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Preface 


i and samples sometimes seem to surround you. Many give valuable informa- 
tion; some, unfortunately, are so poorly conceived and implemented that it would be 
better for science and society if they were simply not done. This book concentrates on 
the statistical aspects of taking and analyzing a sample. It gives you guidance on how 
to tell when a sample is valid or not, and how to design and analyze many different 
forms of sample surveys. 

Much research has been done on theoretical and applied aspects of survey sam- 
pling since the publication of the first edition of this book. The second edition incor- 
porates some of this recent research, contains new topics such as total survey design 
and statistical issues in Internet surveys, and expands coverage of weighting, cali- 
bration, two-phase sampling, and sampling for rare events. The order of topics has 
been streamlined to be more intuitive, chapter summaries have been added for quick 
review, and exercise sets are now better categorized by problem type. SAS® software 
is now used for calculations, with downloadable SAS code provided on the book’s 
companion website. 

Six main features distinguish this book from other texts about sampling methods. 


« The book is accessible to students with a wide range of statistical backgrounds, 
and is flexible for content and level. By appropriate choice of sections, this book 
can be used for a first-year graduate course for statistics students or for a class 
with students from business, sociology, psychology, or biology who want to learn 
about designing and analyzing data from sample surveys. It is also useful for a 
person doing survey research who wants to learn more about the statistical aspects 
of surveys and recent developments. 


a Ihavetried to use real data as much as possible—the Acme Widget Company never 
appears in this book. The examples and exercises come from social sciences, engi- 
neering, agriculture, ecology, medicine, and a variety of other disciplines, and are 
selected to illustrate the wide applicability of sampling methods. A number of 
data sets have extra variables not specifically referenced in the text; an instruc- 
tor can use these for additional exercises or variations. The exercises also give 
the instructor much flexibility for course level. Some emphasize mastering the 
mechanics, but many encourage the student to think about the sampling issues 
involved and to understand the structure of sample designs at a deeper level, while 
others are open-ended and encourage further exploration of the ideas. 


= Ihave incorporated model-based as well as randomization-based theory into the 
text, with the goal of placing sampling methods within the framework used in 
other areas of statistics. Many of the important results in the last twenty-plus 
years of sampling research have involved models, and an understanding of both 
approaches is essential for the survey practitioner. The model-based approach is 
introduced in Section 2.9 and further developed in successive chapters; however, 
those sections could be discussed at any time later in the course. 
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a The book covers many topics not found in other textbooks at this level. Chapters 
7 through 15 discuss how to analyze complex surveys such as those administered 
by the United States Census Bureau or Statistics Canada, computer-intensive 
methods for estimating variances in complex surveys, what to do if there is nonre- 
sponse, and how to perform chi-squared tests and regression analyses using data 
from complex surveys. 


a This book emphasizes the importance of graphing the data. Graphical analysis of 
survey data is challenging because of the large sizes and complexity of survey 
data sets but graphs can provide insight into the data structure. 


= Design of surveys is emphasized throughout, and is related to methods for analyz- 
ing the data from a survey. The book presents the philosophy that the design is by 
far the most important aspect of any survey: no amount of statistical analysis can 
compensate for a badly designed survey. Models are used to motivate designs, and 
graphs are presented to check the sensitivity of the design to model assumptions. 


Chapters | through 6 cover the building blocks of simple random, stratified, and cluster 
sampling, as well as ratio and regression estimation. To read them requires familiarity 
with basic ideas of expectation, sampling distributions, confidence intervals, and linear 
regression—material covered in most introductory statistics classes. Optional sections 
on the statistical theory for designs are marked with asterisks—these require you to be 
familiar with calculus and mathematical statistics. Along with Chapters 7 and 8, these 
chapters form the foundation of a one-quarter or one-semester course. The material 
in Chapters 9 through 15 can be covered in almost any order, with topics chosen to 
fit the needs of the students. Appendix A reviews probability concepts. 

The second edition introduces some organizational changes to the chapters. The 
central concept of sampling weights is now introduced in Chapter 2. Stratified sam- 
pling has been moved earlier to Chapter 3, preceding ratio and regression estimation. 
This allows students to become more familiar with the use of weights to account for 
inclusion probabilities before they are exposed to adjusting the weights for calibration. 
Chapter 6 contains more intuition and theory on the Horvitz-Thompson estimator, 
and Chapter 7 provides additional methods for graphing survey data. Chapter 9 has 
expanded treatment of computer-intensive methods such as jackknife and bootstrap. 
Material in Chapter 12 of the first edition has been expanded in Chapters 12 to 14 
of the second edition. Chapter 15 on total survey design is completely new, and ties 
together much of the material in the earlier chapters. Each chapter now concludes 
with a chapter summary, including key terms and references for further exploration. 

The exercises in the second edition have been reordered into four categories in each 
chapter, with many new exercises added to the book’s already extensive problem sets. 


a Introductory Exercises give more routine problems intended to develop skills at 
the basic ideas in the book. Many of these would be suitable for hand calculations. 


a Working with Survey Data exercises ask students to analyze data from real surveys. 
Most require use of statistical software such as SAS. Data sets and SAS code for 
dealing with special problems in reading the data are available for download from 
the book’s companion website. 


a Working with Theory exercises are intended for a more mathematically oriented 
class, allowing students to work through proofs of results in a step-by-step manner 
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and explore the theory of sampling in more depth. They also include presentations 
of some results about survey sampling that may be of interest to more advanced 
students. Many of these exercises require students to know calculus and material 
from an undergraduate probability class. 


a Projects and Activities—new for the second edition—contain activities suitable 
for classroom use or for assignment as projects. Many of these activities ask the 
student to design, collect, and analyze a sample selected from a population of 
real data provided on the book’s companion website. The activities continue from 
chapter to chapter, allowing students to build on their knowledge and compare 
various sampling designs. I always assign Exercise 31 from Chapter 7 and its 
continuation in subsequent chapters as a course project. This exercise asks students 
to download data from a survey on a topic of their choice from the Internet and 
analyze the data. Along the way, the students read and translate the survey design 
descriptions into the design features studied in class, develop skills in analyzing 
survey data, and gain experience in dealing with nonresponse and other challenges. 


You must know how to use a statistical computer package to be able to do the problems 
in this book. The second edition uses SAS software for computing estimates and 
graphing data, with selected output presented in the book and annotated code used 
for the examples available for download on the book’s companion website. Other 
software packages that calculate estimates for survey data can also be used with 
the book; the website www.hcp.med.harvard.edu/statistics/survey-soft/ provides an 
up-to-date overview of these programs. 

The book’s companion website, www.cengage.com/statistics/lohr, contains the 
SAS code and data sets referenced in the book, as well as the exercises and appendix 
from the first edition for the SURVEY computer program. Additionally, worked solu- 
tions to the exercises in the book are provided online to instructors who sign up for an 
account with Cengage’s Solution Builder service at www.cengage.com/solutionbuilder. 

Many people have been generous with their encouragement and suggestions for 
this book. I am grateful to JIN.K. Rao for his permission to adapt material that 
he and I presented at the 2004 Joint Statistical Meetings for inclusion in the sec- 
ond edition, and for his suggestions and unfailing support. The following persons 
reviewed or used various versions of the manuscript, providing valuable suggestions 
for improvement: Elizabeth Stasny, Fritz Scheuren, Nancy Heckman, Ted Chang, 
Steve MacEachern, Mark Conaway, Ron Christensen, Michael Hamada, Partha Lahiri, 
Dale Everson, James Gentle, Ruth Mickey, Sarah Nusser, N.G.N. Prasad, Deborah 
Rumsey, Fritz Scheuren, David Bellhouse, David Marker, Tim Johnson, Stas 
Kolenikov, Serge Alalouf, Trent Buskirk, and Jae-kwang Kim. In addition, Anders 
Lundqvist, Imbi Traat, Andrew Gelman, Ron Christensen, Paul Biemer, Ron Fecso, 
Steve Fienberg, Pierre Lavallée, Mike Hidiroglou, Dave Chapman, Mike Brick, 
Thomas P. Ryan, Kinley Larntz, Shap Wolf, and Burke Grandjean provided much 
helpful advice and encouragement. Ted Chang first encouraged me to turn my class 
notes into a book, and generously allowed use of the SURVEY program. Alastair 
Scott’s inspiring class on sampling at the University of Wisconsin introduced me to 
the joys of the subject. 

Finally, thanks to my wonderful husband Doug for his patience and cheerful 
encouragement. 

Sharon L. Lohr 
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Introduction 


When statistics are not based on strictly accurate calculations, they mislead instead of guide. The mind 
easily lets itself be taken in by the false appearance of exactitude which statistics retain in their 
mistakes, and confidently adopts errors clothed in the form of mathematical truth. 


—Alexis de Tocqueville, Democracy in America 


I. 


A Sample Controversy 


Shere Hite’s book Women and Love: A Cultural Revolution in Progress (1987) had a 
number of widely quoted results: 


= 84% of women are “not satisfied emotionally with their relationships” (p. 804). 


a 70% of all women “married five or more years are having sex outside of their 
marriages” (p. 856). 


as 95% of women “report forms of emotional and psychological harassment from 
men with whom they are in love relationships” (p. 810). 


= 84% of women report forms of condescension from the men in their love rela- 
tionships (p. 809). 


The book was widely criticized in newspaper and magazine articles throughout the 
United States. The Time magazine cover story “Back Off, Buddy” (October 12, 1987), 
for example, called the conclusions of Hite’s study “dubious” and “of limited value.” 

Why was Hite’s study so roundly criticized? Was it wrong for Hite to report the 
quotes from women who feel that the men in their lives refuse to treat them as equals, 
who perhaps have never been given the chance to speak out before? Was it wrong 
to report the percentages of these women who are unhappy in their relationships 
with men? 

Of course not. Hite’s research allowed women to discuss how they viewed their 
experiences, and reflected the richness of these women’s experience in a way that a 
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multiple choice questionnaire could not. Hite erred in generalizing these results to 
all women, whether they participated in the survey or not, and in claiming that the 
percentages above applied to all women. The following characteristics of the survey 
make it unsuitable for generalizing the results to all women. 


« Thesample was self-selected—that is, recipients of questionnaires decided whether 
they would be in the sample or not. Hite mailed 100,000 questionnaires; of these, 
4.5% were returned. 


= The questionnaires were mailed to such organizations as professional women’s 
groups, counseling centers, church societies, and senior citizens’ centers. The 
members may differ in political views, but many have joined an “all-women” 
group, and their viewpoints may differ from other women in the United States. 


= The survey has 127 essay questions, and most of the questions have several parts. 
Who will tend to return such a survey? 


= Many of the questions are vague, using words such as “love.” The concept of love 
probably has as many interpretations as there are people, making it impossible to 
attach a single interpretation to any statistic purporting to state how many women 
are “in love.” Such question wording works well for eliciting the rich individual 
vignettes that comprise most of the book, but makes interpreting percentages 
difficult. 


s Many ofthe questions are leading—they suggest to the respondent which response 
she should make. For instance: “Does your husband/lover see you as an equal? 
Or are there times when he seems to treat you as an inferior? Leave you out of 
the decisions? Act superior?” (p. 795) 


Hite writes “Does research that is not based on a probability or random sample 
give one the right to generalize from the results of the study to the population at 
large? If a study is large enough and the sample broad enough, and if one generalizes 
carefully, yes” (p. 778). Most survey statisticians would answer Hite’s question with 
a resounding “no.” In Hite’s survey, because the women sent questionnaires were 
purposefully chosen and an extremely small percentage of those women returned 
the questionnaires, statistics calculated from these data cannot be used to indicate 
attitudes of all women in the United States. The final sample is not representative of 
women in the United States, and the statistics can only be used to describe women 
who would have responded to the survey. 

Hite claims that results from the sample could be generalized because character- 
istics such as the age, educational, and occupational profiles of women in the sample 
matched those for the population of women in the United States. But the women in 
the sample differed on one important aspect—they were willing to take the time to fill 
out a long questionnaire dealing with harassment by men, and to provide intensely 
personal information to a researcher. We would expect that in every age group and 
socioeconomic class, women who choose to report such information would in gen- 
eral have had different experiences than women who choose not to participate in the 
survey. 


1.2 Requirements of a Good Sample J 


FF 


Requirements of a Good Sample 


In the movie “Magic Town,” the public opinion researcher played by James Stewart 
discovered a town that had exactly the same characteristics as the whole United States: 
Grandview had exactly the same proportion of people who voted Republican, the same 
proportion of people under the poverty line, the same proportion of auto mechanics, 
and so on, as the United States taken as a whole. All that Stewart’s character had to do 
was to interview the people of Grandview, and he would know what public opinion 
was in the United States. 

A perfect sample would be like Grandview: a “scaled-down” version of the pop- 
ulation, mirroring every characteristic of the whole population. Of course, no such 
perfect sample can exist for complicated populations (even if it did exist, we would 
not know it was a perfect sample without measuring the whole population). But a 
good sample will be representative in the sense that characteristics of interest in the 
population can be estimated from the sample with a known degree of accuracy. 

Some definitions are needed to make the notion of a good sample more precise. 


Observation unit An object on which a measurement is taken. This is the basic 
unit of observation, sometimes called an element. In studying human populations, 
observation units are often individuals. 


Target population The complete collection of observations we want to study. Defin- 
ing the target population is an important and often difficult part of the study. For 
example, in a political poll, should the target population be all adults eligible to vote? 
All registered voters? All persons who voted in the last election? The choice of target 
population will profoundly affect the statistics that result. 


Sample A subset of a population. 


Sampled population The collection of all possible observation units that might have 
been chosen in a sample; the population from which the sample was taken. 


Sampling unit A unit that can be selected for a sample. We may want to study 
individuals, but do not have a list of all individuals in the target population. Instead, 
households serve as the sampling units, and the observation units are the individuals 
living in the households. 


Sampling frame A list, map, or other specification of sampling units in the population 
from which a sample may be selected. For a telephone survey, the sampling frame 
might be a list of all residential telephone numbers in the city. For a survey using 
in-person interviews, the sampling frame might be a list of all street addresses. For 
an agricultural survey, a sampling frame might be a list of all farms, or a map of areas 
containing farms. 


In an ideal survey, the sampled population will be identical to the target population, 
but this ideal is rarely met exactly. In surveys of people, the sampled population is 
usually smaller than the target population: as illustrated in Figure 1.1, not all persons 
in the target population are included in the sampling frame, and a number of persons 
will not respond to the survey. 
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FIGURE 1.1 

Target population and sampled population in a telephone survey of likely voters. Not all house- 
holds have telephones, so a number of persons in the target population of likely voters will 
not be associated with a telephone number in the sampling frame. In some households with 
telephones, the residents are not registered to vote and hence are not eligible for the survey. 
Some eligible persons in the sampling frame population do not respond because they cannot 
be contacted, some refuse to respond to the survey, and some may be ill and incapable of 


responding. 
SAMPLING 
TARGET POPULATION FRAME 
| | POPULATION 
Not 
reachable 
Not included in Refuse to SAMPLED Not eligible 
sampling frame respond POPULATION for survey 
Not capable 
of responding 


In the Hite (1987) study, one characteristic of interest was the percentage of women 
who are harassed in their relationship. An individual woman was an element. The 
target population was all adult women in the United States. Hite’s sampled population 
was women belonging to women’s organizations who would return the questionnaire. 
Consequently, inferences can only be made to the sampled population, not to the 
population of all adult women in the United States. 

The National Crime Victimization Survey (NCVS) is an ongoing survey to study 
victimization rates, administered by the U.S. Census Bureau and the Bureau of Jus- 
tice Statistics. If the characteristic of interest is the total number of households in the 
United States that were victimized by crime last year, the elements are households, 
the target population consists of all households in the United States, and the sam- 
pled population consists of households in the sampling frame, constructed from U.S. 
Census information and building permits, that are “at home” and agree to answer 
questions. 

The goal of the National Pesticide Survey, conducted by the U.S. Environmental 
Protection Agency, was to study pesticides and nitrate in drinking water wells nation- 
wide. The target population was all community water systems and rural domestic 
wells in the United States. The sampled population was all community water systems 
(all are listed in the Federal Reporting Data System) and all identifiable domestic wells 
outside of government reservations that belonged to households willing to cooperate 
with the survey. 

Public opinion polls are often taken to predict which candidate will win the next 
election. The target population is persons who will vote in the next election; the 
sampled population is often persons who can be reached by telephone and who are 
judged to be likely to vote in the next election. Few national polls in the 
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United States include persons in hospitals, dormitories, or jails; they, and persons 
without telephones, are not part of the sampling frame or of the sampled population. 


IJ 


Selection Bias 


A good sample will be as free from selection bias as possible. Selection bias occurs 
when some part of the target population is not in the sampled population, or, more 
generally, when some population units are sampled at a different rate than intended 
by the investigator. If a survey designed to study household income omits transient 
persons, the estimates from the survey of the average or median household income 
are likely to be too large. A sample of convenience is often biased, since the units that 
are easiest to select or that are most likely to respond are usually not representative of 
the harder-to-select or nonresponding units. The following examples indicate some 
ways in which selection bias can occur. 


« Using a sample selection procedure that, unknown to the investigators, depends 
on some characteristic associated with the properties of interest. For example, 
investigators took a convenience sample of adolescents to study how frequently 
adolescents talk to their parents and teachers about AIDS. But adolescents willing 
to talk to the investigators about AIDS are probably also more likely to talk to other 
authority figures about AIDS. The investigators, who simply averaged the amounts 
of time that adolescents in the sample said they spent talking with their parents 
and teachers, probably overestimated the amount of communication occurring 
between parents and adolescents in the population. 


« Deliberately or purposively selecting a “representative” sample. If we want to 
estimate the average amount a shopper spends at the Mall of America in a shopping 
trip, and we sample shoppers who look like they have spent an “average” amount, 
we have deliberately selected a sample to confirm our prior opinion. This type of 
sample is sometimes called a judgment sample—the investigator uses his or her 
judgment to select the specific units to be included in the sample. 


« Misspecifying the target population. For instance, all the polls in the 1994 Demo- 
cratic gubernatorial primary election in Arizona predicted that candidate Eddie 
Basha would trail the front-runner in the polls by at least nine percentage points. 
In the election, Basha won 37% of the vote; the other two candidates won 35% 
and 28%, respectively. One problem is that many voters were undecided at the 
time the polls were taken. Another is that the target population for the polls was 
registered voters who had voted in previous primary elections and were interested 
in this one. In the primary election, however, Basha had heavy support in rural 
areas from demographic groups that had not voted before and hence were not 
targeted in the surveys. 


« Failing to include all of the target population in the sampling frame, called under- 
coverage. The U.S. Behavioral Risk Factor Surveillance System survey, described 
at www.cdc.gov, illustrates some of the coverage problems that may occur in a 
household telephone survey. The target population for this survey on preventive 
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health practices and risk behaviors is adults aged 18 and older in the United 
States. Some undercoverage occurs because persons in institutions such as nurs- 
ing homes or prisons are excluded. Additional undercoverage occurs because the 
survey is conducted by telephone. Some households do not have telephones and 
telephone coverage varies across states. Households in the southern part of the 
United States, minority households, and low-income households are less likely to 
have telephones, so those households are likely to be underrepresented in the sam- 
ple because of the undercoverage. Households that have only a cellular telephone 
are also not included in the sampling frame at this writing. 


Including population units in the sampling frame that are not in the target pop- 
ulation, called overcoverage. Overcoverage can occur when persons not in the 
target population are not screened out of the sample, or when data collectors are 
not given specific instructions on sample eligibility. The target population for a 
telephone survey on radio listening habits might be persons aged 18 and over, but 
some interviewers might include persons under age 18 when taking the sample, 
and children and teenagers may well listen to different radio stations than adults. 


Having multiplicity of listings in the sampling frame, without adjusting for the 
multiplicity in the analysis. In its simplest form, random digit dialing prescribes 
selecting a random sample of 10-digit numbers. Households with more than one 
telephone line then have a higher chance of being selected in the sample. This mul- 
tiplicity can be compensated in the estimation (we’ll discuss this in Section 6.5); 
if it is ignored, bias can result. One might expect households with more telephone 
lines to be larger or more affluent, so if no adjustment is made for those house- 
holds having a higher probability of being selected for the sample, estimates of 
average income or household size may be too large. 


Substituting a convenient member of a population for a designated member who 
is not readily available. For example, if no one is at home in the designated 
household, a field representative might try next door. In a wildlife survey, the 
investigator might substitute an area next to a road for a less accessible area. In 
each case, the sampled units most likely differ on a number of characteristics from 
units not in the sample. The substituted household may be more likely to have 
a member who does not work outside of the house than the originally selected 
household. The area by the road may have fewer frogs than the area that is harder 
to reach. 


Failing to obtain responses from all of the chosen sample. Nonresponse distorts 
the results of many surveys, even surveys that are carefully designed to minimize 
other sources of selection bias. Often, nonrespondents differ critically from the 
respondents, but the extent of that difference is unknown unless you can later 
obtain information about the nonrespondents. Many surveys reported in news- 
papers or research journals have dismal response rates—in some, the response 
rate is as low as 10%. It is difficult to see how results can be generalized to the 
population when 90% of the targeted sample cannot be reached or refuses to 
participate. 

The Adolescent Health Database Survey was designed to obtain a representa- 
tive sample of Minnesota junior and senior high school students in public schools 
(Remafedi et al., 1992). Overall, 49% of the school districts that were invited to 
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participate in the survey agreed to participate. The response rate varied with the 
size of the school district: 


Type of School District Participation Rate (%) 
Urban 100 
Metropolitan suburban 25 
Nonmetropolitan with more than 2000 students 62 
Nonmetropolitan with 1000-1999 students 27 
Nonmetropolitan with 500-999 students 61 
Nonmetropolitan with fewer than 500 students 53 


In each of the school districts that participated, surveys were distributed to stu- 
dents and students’ participation was voluntary. Of the 52,553 surveys distributed 
to students, 36,741 were completed and returned, resulting in a student response 
rate of 69%. The survey asked questions about health habits, religious affiliation, 
psychosocial status, and sexual orientation. It seems likely that responding and 
nonresponding school districts have different levels of health and activity. It seems 
even more likely that students who respond to the survey will, on average, have a 
different health profile than students who do not respond to the survey. 

Many studies comparing respondents and nonrespondents have found dif- 
ferences in the two groups. In the lowa Women’s Health Study, 41,836 women 
responded to a mailed questionnaire in 1986. Bisgard et al. (1994) compared those 
respondents to the 55,323 nonrespondents by checking records in the State Health 
Registry; they found that the age-adjusted mortality rate and the cancer attack rate 
were significantly higher for the nonrespondents than for the respondents. 


Allowing the sample to consist entirely of volunteers. Such is the case in radio 
and television call-in polls, and in most online surveys. The statistics from such 
surveys cannot be trusted. At best, they are entertainment; at worst, they mislead, 
particularly when statistics from polls with self-selected respondents are cited 
in policy debates without any mention of their unscientific nature. CNN.com’s 
daily QuickVote, which invites site visitors to vote on an issue of the day, care- 
fully states that “This QuickVote is not scientific and reflects the opinions of 
only those Internet users who have chosen to participate. The results cannot be 
assumed to represent the opinions of Internet users in general, nor the public as 
a whole” (Cable News Network, 2002). Yet statistics from Quick Vote and other 
online surveys are frequently quoted by independent research institutes, policy 
organizations, and scholarly journals. For example, Christian and Kinney (1999) 
cited a 1999 Internet poll on CNN.com, where 98% of the 17,000 visitors to a 
website linked to the science and technology reports voted “yes” to a question on 
whether the Hubble Space Telescope was worth the investment, as an indication 
of “a great improvement in public opinion.” In fact, all that can be concluded from 
the Internet poll is that nearly 17,000 people who visited a website voted “yes” 
on the question; nothing can be inferred about the rest of the population with- 
out making heroic assumptions. Some individuals or organizations may respond 
multiple times to a voluntary survey, and a determined organization may skew the 
results. 


EXAMPLE 1.1 


§ Chapter 1: Introduction 


Many surveys have more than one of these problems. The Literary Digest (1932, 
1936a, b, c) began taking polls to forecast the outcome of the U.S. presidential election 
in 1912, and their polls attained a reputation for accuracy because they forecast the 
correct winner in every election between 1912 and 1932. In 1932, for example, the 
poll predicted that Roosevelt would receive 56% of the popular vote and 474 votes in 
the Electoral College; in the actual election, Roosevelt received 58% of the popular 
vote and 472 votes in the Electoral College. 

With such a strong record of accuracy, it is not surprising that the editors of The 
Literary Digest had a great deal of confidence in their polling methods by 1936. 
Launching the 1936 poll, they said: 


The Poll represents thirty years’ constant evolution and perfection. Based on the “com- 
mercial sampling” methods used for more than a century by publishing houses to push 
book sales, the present mailing list is drawn from every telephone book in the United 
States, from the rosters of clubs and associations, from city directories, lists of regis- 
tered voters, classified mail-order and occupational data. (1936a, p. 3) 


On October 31, the poll predicted that Republican Alf Landon would receive 
55% of the popular vote, compared with 41% for President Roosevelt. The article 
“Landon, 1,293,669; Roosevelt, 972,897: Final Returns in The Digest’s Poll of Ten 
Million Voters” contained the statement “We make no claim to infallibility. We did 
not coin the phrase ‘uncanny accuracy’ which has been so freely applied to our Polls” 
(1936b). It is a good thing they made no claim to infallibility: In the election, Roosevelt 
received 61% of the vote; Landon, 37%. 

What went wrong? One problem may have been undercoverage in the sampling 
frame, which relied heavily on telephone directories and automobile registration 
lists—the frame was used for advertising purposes, as well as for the poll. House- 
holds with a telephone or automobile in 1936 were generally more affluent than other 
households, and opinion of Roosevelt’s economic policies was generally related to 
the economic class of the respondent. But sampling frame bias does not explain all the 
discrepancy. Postmortem analyses of the poll by Squire (1988) and Calahan (1989) 
indicate that even persons with both a car and a telephone tended to favor Roosevelt, 
though not to the degree that persons with neither car nor telephone supported him. 

The low response rate to the survey was likely the source of much of the error. Ten 
million questionnaires were mailed out, and 2.3 million were returned—an enormous 
sample, but a response rate of less than 25%. In Allentown, Pennsylvania, for example, 
the survey was mailed to every registered voter, but the survey results for Allentown 
were still incorrect because only one-third of the ballots were returned. Squire (1988) 
reports that persons supporting Landon were much more likely to have returned the 
survey; in fact, many Roosevelt supporters did not even remember receiving a survey 
even though they were on the mailing list. 

One lesson to be learned from The Literary Digest poll is that the sheer size 
of a sample is no guarantee of its accuracy. The Digest editors became complacent 
because they sent out questionnaires to more than one quarter of all registered voters 
and obtained a huge sample of 2.3 million people. But large unrepresentative samples 
can perform as badly as small unrepresentative samples. A large unrepresentative 
sample may do more damage than a small one because many people think that large 
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samples are always better than small ones. The design of the survey is far more 
important than the absolute size of the sample. m= 


What good are samples with selectionbias? We prefer to have samples with no selection 
bias, that serve as a microcosm of the population. When the primary interest is in 
estimating the total number of victims of violent crime in the United States, or the 
percentage of likely voters in the United Kingdom who intend to vote for the Labour 
Party in the next election, serious selection bias can cause the sample estimates to be 
invalid. 

Purposive or judgment samples can provide valuable information, though, partic- 
ularly in the early stages of an investigation. Teichman et al. (1993) took soil samples 
along Interstate 880 in Alameda County, California, to determine the amount of lead 
in yards of homes and in parks close to the freeway. In taking the samples, they con- 
centrated on areas where they thought children were likely to play and areas where 
soil might easily be tracked into homes. The purposive sampling scheme worked well 
for justifying the conclusion of the study, that “lead contamination of urban soil in 
the east bay area of the San Francisco metropolitan area is high and exceeds haz- 
ardous waste levels at many sites.” A sampling scheme that avoided selection bias 
would be needed for this study if the investigators wanted to generalize the estimated 
percentage of contaminated sites to the entire area. 


l4 


Measurement Error 


A good sample has accurate responses to the items of interest. When a response in the 
survey differs from the true value, measurement error has occurred. Measurement 
bias occurs when the response has a tendency to differ from the true value in one 
direction. As with selection bias, measurement error and bias must be considered and 
minimized in the design stage of the survey; no amount of statistical analysis will 
disclose that the scale erroneously added 5 kilograms to the weight of every person 
in the health survey. 

Measurement error is a concern in all surveys and can be insidious. In many sur- 
veys of vegetation, for example, areas to be sampled are divided into smaller plots. 
A sample of plots is selected, and the number of plants in each plot is recorded. 
When a plant is near the boundary of the region, the field researcher needs to decide 
whether to include the plant in the tally. A person who includes all plants near or 
on the boundary in the count is likely to produce an estimate of the total number of 
plants in the area that is too high because some plants may be counted twice. Duce 
et al. (1972) report concentrations of trace metals, lipids, and chlorinated hydrocar- 
bons in the top 100 micrometers of Narragansett Bay that are 1.5 to 50 times as great 
as those in the water 20 cm below the surface. If studying the transport of pollutants 
from coastal waters to the deeper waters of the ocean, a sampling scheme that ignores 
this boundary effect may underestimate the amount transported. 

Sometimes measurement bias is unavoidable. In the North American Breeding 
Bird Survey, observers stop every one-half mile on designated routes and count all 
birds heard or seen during a 3-minute period within a quarter-mile radius 
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(Sauer et al., 1997). The count of birds for that point is almost always an under- 
estimate of the number of birds in the area; statistical models may possibly be used 
to adjust for the measurement bias. If data are collected with the same procedure and 
with similarly skilled observers from year to year, the survey can be used to esti- 
mate trends in the population of different species—the biases from different years are 
expected to be similar, and may cancel when year-to-year differences are calculated. 

Obtaining accurate responses is challenging in all types of surveys, but particularly 
so in surveys of people: 


= People sometimes do not tell the truth. In an agricultural survey, farmers in an 
area with food-aid programs may underreport crop yields, hoping for more food 
aid. Obtaining truthful responses is a particular challenge in surveys involving 
sensitive subject matter, such as surveys about drug use. 


« People do not always understand the questions. Many persons in the United States 
were shocked by the results of a 1993 Roper poll reporting that 25% of Americans 
did not believe the Holocaust really happened. When the double-negative structure 
of the question was eliminated, and the question reworded, only 1% thought it 
was “possible ... the Nazi extermination of the Jews never happened.” 


= People forget. One problem faced in the design of the NCVS is that of telescoping: 
Persons are asked about experiences as a crime victim in the last six months, but 
some include victimizations that occurred more than six months ago. 


us People give different answers to different interviewers. Schuman and Converse 
(1971) employed white and black interviewers to interview black residents of 
Detroit. In response to the question “Do you personally feel that you can trust 
most white people, some white people, or none at all?” 35% of the respondents 
interviewed by a white person said they could trust most white people. The per- 
centage was 7% for those interviewed by a black person. 


= People may say what they think an interviewer wants to hear or what they think will 
impress the interviewer. In experiments done with questions beginning “Do you 
agree or disagree with the following statement” it has been found that a subset of 
the population tends to agree with any statement regardless of its content. Lenski 
and Leggett (1960) found that about one-tenth of their sample agreed with both 
of the following statements: 


It is hardly fair to bring children into the world, the way things look for the future. 
Children born today have a wonderful future to look forward to. 


Some responses are perceived as being more socially desirable than others, so 
that persons may overreport behaviors such as exercising and donating to charities, 
and underreport behaviors such as smoking or drinking. 


a A particular interviewer may affect the accuracy of the response, by misreading 
questions, recording responses inaccurately, or antagonizing the respondent. In 
a survey on abortion, a poorly trained interviewer with strong feelings about 
abortion may encourage the respondent to provide one answer rather than another. 
In extreme cases, an interviewer may change the answers given by the respondent, 
or simply make up data and not contact the respondent at all. 
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= Certain words mean different things to different people. A simple question such 
as “Do you own a car?” may be answered yes or no depending on the respondent’s 
interpretation of “you” (does it refer to just the individual, or to the household?), 
“own” (does it count as ownership if you are making payments to a finance 
company?), or “car” (are pickup trucks included?). 


= Question wording and question order have a large effect on the responses obtained. 
Two surveys were taken in late 1993/early 1994 about Elvis Presley. One survey 
asked, “In the past few years, there have been a lot of rumors and stories about 
whether Elvis Presley is really dead. How do you feel about this? Do you think 
there is any possibility that these rumors are true and that Elvis Presley is still 
alive, or don’t you think so?" The other survey asked, “A recent television show 
examined various theories about Elvis Presley’s death. Do you think it is possible 
that Elvis is alive or not?” Eight percent of the respondents to the first question 
said it is possible that Elvis is still alive; 16% of respondents to the second question 
said it is possible that Elvis is still alive. 


Excellent discussions of these problems can be found in Groves et al. (2009) 
and Tourangeau et al. (2000). In some cases, accuracy can be increased by careful 
questionnaire design. 


I 


Questionnaire Design 


This section gives a very brief introduction to writing and testing questions. It provides 
some general guidelines and examples, but if you are writing a questionnaire, you 
should consult one of the more comprehensive references on questionnaire design 
listed at the end of this chapter. 

The most important step in writing a questionnaire is to decide what you want 
to find out. Write down the goals of your survey, and be precise. “I want to learn 
something about the homeless” won’t do. Instead, you should write down specific 
questions, such as “What percentage of persons using homeless shelters in Chicago 
between January and March 1996 are under 16 years old?” Then, write or select 
questions that will elicit accurate answers to the research questions, and that will 
encourage persons in the sample to respond to the questions. 


a Always test your questions before taking the survey. Ideally, the questions would 
be tested on a small sample of members of the target population. Try different 
versions for the questions, and ask respondents in your pretest how they interpret 
the questions. 

The NCVS was tested for several years before it was conducted on a national 
scale (Lehnen and Skogan, 1981). The pretests were used to help decide on a 
recall period (it was decided to ask respondents about victimizations that had 
occurred in the previous six months), test interviewing procedures and questions, 
and compare information from selected interviews with information found in the 
police report about the victimization. As a result of the pretests, some of the long 
and repetitious questions were shortened and more specific wording introduced. 
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The questionnaire was revised in 1985 and again in 1991 to make use of 
recent research in cognitive psychology and to include topics, such as victim and 
bystander behavior, that were not found in the earlier versions. All revisions are 
tested extensively in the field before being used (Taylor, 1989). In the past, for 
example, the NCVS has been criticized for underreporting the crime of rape; when 
the questionnaire was designed in the early 1970s, there was worry that asking 
about rape directly would be perceived as insensitive and embarrassing, and would 
provoke congressional outrage. The original NCVS questionnaire asked a series of 
specific questions intended to prompt the memory of respondents. These included 
questions such as “Did anyone take something directly from you by using force, 
such as by a stickup, mugging or threat?” The last question in the violent crime 
screening section of the questionnaire was “Did anyone try to attack you in some 
other way?” If the respondent mentioned in response that he or she was raped, 
then a rape was reported. Not surprisingly, the victimization rate for rape reported 
for the 1990 and earlier NCVS is very low: It is reported that about 1 per 1000 
females aged 12 and older were raped in 1990. The current version of the NCVS 
questionnaire asks about rape directly. 

You will not necessarily catch misinterpretations of questions by trying them 
out on friends or colleagues; your friends and colleagues may have backgrounds 
similar to yours, and may not have the same understanding of words as persons 
in your target population. Belson (1981) demonstrated that each of 29 questions 
about television viewing was misinterpreted by some respondents. The question 
“Do you think that the television news programmes are impartial about politics?” 
was tested on 56 people. Of these, 13 interpreted the question as intended, 18 
respondents narrowed the term news programmes to mean “news bulletins,” 21 
narrowed it to “political programmes,” and | interpreted it as “newspapers.” Only 
25 persons interpreted “impartial” as intended; 5 inferred the opposite meaning, 
“partial”; 11, as “giving too much or too little attention to”; and the others were 
simply unfamiliar with the word. Suessbrick et al. (2000) found that the concepts 
in a seemingly clear question such as “Have you smoked at least 100 cigarettes in 
your entire life?’ were commonly interpreted in a different way than the authors 
intended: Some respondents included marijuana cigarettes or cigars, while others 
excluded cigarettes that were only partially smoked or hand-rolled cigarettes. 


Keep it simple and clear. Questions that seem clear to you may not be clear to 
someone listening to the whole question over the telephone, or to a person with 
a different native language. Belson (1981, p. 240) tested the question “What pro- 
portion of your evening viewing time do you spend watching news programmes?” 
on 53 people. Only 14 people correctly interpreted the word “proportion” as “‘per- 
centage,” “part,” or “fraction.” Others interpreted it as “how long do you watch” 


or “which news programmes do you watch.” 


Use specific questions instead of general ones, if possible. Strunk and White 
advised writers to “Prefer the specific to the general, the definite to the vague, the 
concrete to the abstract” (1959, p. 15). Good questions result from good writing. 

Instead of asking “Did anyone attack you in the last six months,” the NCVS 
asks a series of specific questions detailing how one might be attacked. The NCVS 
question is “Has anyone attacked or threatened you in any of these ways: (a) With 
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any weapon, for instance, a gun or knife, (b) With anything like a baseball bat, 
frying pan, scissors, or stick ....” 


Relate your questions to the concept of interest. This seems obvious but is forgotten 
or ignored in many surveys. In some disciplines, a standard set of questions has 
been developed and tested, and these are then used by subsequent researchers. 
Often, use of a common survey instrument allows results from different studies 
to be compared. In some cases, however, the standard questions are inappropriate 
for addressing the research hypotheses. 

Pincus (1993) criticizes early research that concluded that persons with arthri- 
tis were more likely to have psychological problems than persons without arthri- 
tis. In those studies, persons with arthritis were given the Minnesota Multiphasic 
Personality Inventory, a test of 566 true/false questions commonly used in psy- 
chological research. Patients with rheumatoid arthritis tended to have high scores 
on the scales of hypochondriasis, depression, and hysteria. Part of the reason they 
scored highly on those scales is clear when the actual questions are examined. A 
person with arthritis can truthfully answer false to questions such as “I am about 
as able to work as I ever was,” “I am in just as good physical health as most of 
my friends,’ and “I have few or no pains” without being either hysterical or a 
hypochondriac. 


Decide whether to use open or closed questions. An open question allows respon- 
dents to form their own response categories; in a closed question (multiple 
choice), the respondent chooses from a set of categories read or displayed. Each 
has advantages. A closed question may prompt the respondent to remember 
responses that might otherwise be forgotten, and is in accordance with the prin- 
ciple that specific questions are better than general ones. If the subject matter has 
been thoroughly pretested and responses of interest are known, a well-written 
closed question will usually elicit more accurate responses, as in the NCVS ques- 
tion “Has anyone attacked or threatened you with anything like a baseball bat, 
frying pan, scissors, or stick?” If the survey is exploratory or questions are sen- 
sitive, though, it is often better to use an open question: Bradburn and Sudman 
(1979) note that respondents reported higher frequency of drinking alcoholic bev- 
erages when asked an open question than a closed question with categories “never” 
through “daily.” 

Schuman and Scott (1987) conclude that, depending on the context, either open 
or closed questions can limit the types of responses received. In one experiment, 
the most common responses to the open question “What do you think is the 
most important problem facing this country today?” were “unemployment” (17%) 
and “general economic problems” (17%). The closed version asked, “Which of 
the following do you think is the most important problem facing this country 
today—the energy shortage, the quality of public schools, legalized abortion, or 
pollution—or if you prefer, you may name a different problem as most important”; 
32% or respondents chose “the quality of public schools.” In this case, the limited 
options in the closed question guided respondents to one of the listed responses. 
In another experiment, Schuman and Scott (1987) asked respondents to name one 
or two of the most important national world events or changes during the last 
50 years. Persons asked the open question most frequently gave responses such 
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as World War II or the Vietnam War; they typically did not mention events such 
as the invention of the computer, which was the most prevalent response to the 
closed question including this option. 

If using a closed question, always have an “other” category. In one study of 
sexual activity among adolescents, adolescents were asked from whom they felt 
the most pressure to have sex. Categories for the closed question were “friends of 
same sex,” “boyfriend/girlfriend,” “friends of opposite sex,” “TV or radio,” “don’t 
feel pressure,” and “other.” The response “parents” or “father” was written in by 
a number of the adolescent respondents, a response that had not been anticipated 
by the researchers. 


Report the actual question asked. Public opinion is complex, and you inevitably 
leave a distorted impression of it when you compress the results of your careful 
research into a summary statement “x% of Americans favor affirmative action.” 

The results of three surveys in Spring 1995, all purportedly about affirmative 
action, emphasize the importance of reporting the question. A Newsweek poll 
asked “Should there be special consideration for each of the following groups 
to increase their opportunities for getting into college and getting jobs or pro- 
motions?” and asked about these groups: blacks, women, Hispanics, Asians, and 
Native Americans. The poll found that 62% of blacks but only 25% of whites 
answered “yes” to the question about blacks. A USA Today-CNN-Gallup poll 
asked the question “What is your opinion on affirmative action programs for 
women and minorities: do you favor them or oppose them?" and reported that 
55% of respondents favored such programs. A Harris poll asking “Would you 
favor or oppose a law limiting affirmative action programs in your state?” reported 
51% of respondents favoring such a law. These questions are clearly addressing 
different concepts because the differences in percentages obtained are too great 
to be ascribed to the different samples of people taken by the three organizations. 
Yet all three polls’ results were described in newspapers in terms of percentages 
of persons who support affirmative action. 


Avoid questions that prompt or motivate the respondent to say what you would 
like to hear. These are often called leading, or loaded, questions. The May 17, 
1994 issue of The Wall Street Journal reported the following question asked by the 
Gallup Organization in a survey commissioned by the American Paper Institute: 
“It is estimated that disposable diapers account for less than 2 percent of the trash 
in today’s landfills. In contrast, beverage containers, third-class mail and yard 
waste are estimated to account for about 21 percent of trash in landfills. Given 
this, in your opinion, would it be fair to tax or ban disposable diapers?” 


Consider the social desirability of responses to questions, and write questions 
that elicit honest responses. Abelson et al. (1992) review several studies that find 
many people say they voted in the last election when they actually did not vote. 
They argue that voting is a socially desirable behavior, and many respondents do 
not want to admit that they did not vote; respondents need to be prompted to report 
their actual behavior. 


Avoid double negatives. Double negatives needlessly confuse the respondent. A 
question such as “Do you favor or oppose not allowing drivers to use cell phones 
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while driving?” might elicit either “favor” or “oppose” from a respondent who 
thinks persons should not use cell phones while driving. 


Use forced-choice, rather than agree/disagree questions. As noted earlier, some 
persons will agree with almost any statement. Schuman and Presser (1981, p. 223) 
report the following differences from an experiment comparing agree/disagree 
with forced-choice versions: 

Q1: Do you agree or disagree with this statement: Most men are better suited 
emotionally for politics than are most women. 

Q2: Would you say that most men are better suited emotionally for politics 
than are most women, that men and women are equally suited, or that women are 
better suited than men in this area? 


Years of schooling 
0-11 12 13+ 


QI: Percent “agree” 57 44 39 
Q2: Percent “men better suited” 33 38 28 


Ask only one concept per question. In particular, avoid what are sometimes called 
double-barreled questions, so named because if one barrel of the shotgun does 
not get you, the other one will. 

The question “Do you agree with Bill Clinton’s $50 billion bailout of Mexico?” 
appeared on a survey distributed by a member of the U.S. House of Represen- 
tatives to his constituents. The question is really confusing two opinions of the 
respondent: the opinion of Bill Clinton, and the opinion of the Mexico policy. 
Disapproval of either one will lead to a “disagree” answer to the question. Note 
also the loaded content of the word bailout, which will almost certainly elicit 
more negative responses than the term aid package would. 


Pay attention to question order effects. If you ask more than one question on a 
topic, itis usually (but not always) better to ask the more general question first and 
follow it by the specific questions. McFarland (1981) conducted an experiment in 
which half of the respondents were given general questions (for example, “How 
interested would you say you are in religion: very interested, somewhat interested, 
or not very interested”’) first, followed by specific questions on the subject (“Did 
you, yourself, happen to attend church in the last seven days?’’); the other half 
were asked the specific questions first and then asked the general questions. When 
the general question was asked first, 56% reported that they were “‘very interested 
in religion”; the percentage rose to 64% when the specific question was asked 
first. 

Serdula et al. (1995) found that in the years in which a respondent of a health 
survey was asked to report his or her weight and then immediately asked “Are 
you trying to lose weight?” 28.8% of men and 48.0% of women reported that they 
were trying to lose weight. When “Are you trying to lose weight?” was asked in 
the middle of the survey and the self-report question on weight at the end of the 
survey, 26.5% of the men and 40.9% of the women reported that they were trying 
to lose weight. The authors speculate that respondents who are reminded of their 
weight status may overreport trying to lose weight. 
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The 2000 U.S. Census had separate questions for race (with categories white, 
black, American Indian or Alaskan Native, Asian Indian, Chinese, and 
Filipino, among others) and ethnicity (with categories non-Hispanic, Mexican- 
American, Puerto Rican, Cuban, and other Hispanic). These are considered sep- 
arate classifications; an individual can, for instance, be white Cuban or black 
non-Hispanic. The Census Bureau has done a great deal of experimental research 
to determine effects of alternate wordings and orderings of these questions on 
responses (Bates et al., 1995). Martin et al. (2005) report results of experiments 
comparing the questions used for the 1990 Census with those used for the 2000 
Census. In 1990, race was question 4 and ethnicity was question 7; in 2000, ethnic- 
ity was question 7 and race was question 8. When the question on race occurred 
first, as in the 1990 Census, some Hispanic respondents looked for a Hispanic 
category, did not find it, and checked the “Other Race” category. After answering 
the race question, some persons skipped the ethnicity question so that there was 
substantial nonresponse on the ethnicity question. The reversed question order 
and other changes in the 2000 Census led to less missing data on both the race 
and ethnicity questions. 


I. 


Sampling and Nonsampling Errors 


Most opinion polls that you see report a “margin of error.” Many merely say that the 
margin of error is 3 percentage points. Others give more detail, as in this excerpt from 
a New York Times poll: “In theory, in 19 cases out of 20 the results based on such 
samples will differ by no more than three percentage points in either direction from 
what would have been obtained by interviewing all Americans.” The margin of error 
given in polls is an expression of sampling error, the error that results from taking 
one sample instead of examining the whole population. If we took a different sample, 
we would most likely obtain a different sample percentage of persons who visited 
the public library last week. Sampling errors are usually reported in probabilistic 
terms. We discuss the calculation of sampling errors for different survey designs in 
Chapters 2 through 7. 

Selection bias and measurement error are examples of nonsampling errors, 
which are any errors that cannot be attributed to the sample-to-sample variability. 
In many surveys, the sampling error that is reported for the survey may be negligi- 
ble compared to the nonsampling errors; you often see surveys with a 30% response 
rate proudly proclaiming their 3% margin of error, while ignoring the tremendous 
selection bias in their results. 

The goal of this chapter was to sensitize you to various forms of selection bias and 
inaccurate responses. We can reduce some forms of selection bias by using probability 
sampling methods, as described in the next chapter. Accurate responses can often 
be achieved through careful design and testing of the survey instrument, thorough 
training of interviewers, and pretesting the survey. We shall return to nonsampling 
errors in Chapter 8, where we discuss methods that have been proposed for trying to 
reduce nonresponse error after the survey has been collected (sneak preview: none of 
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the methods is as good as obtaining a high response rate to begin with), and Chapter 15, 
where we present a unified approach to survey design that attempts to minimize both 
sampling and nonsampling error. 


Why sample atall? With the abundance of poorly done surveys, it is not surprising that 
some people are skeptical of all surveys. “After all,’ some say, “my opinion has never 
been asked, so how can the survey results claim to represent me?” Public questioning 
of the validity of surveys intensifies after a survey makes a large mistake in predicting 
the results of an election, such as in the Literary Digest survey of 1936 or in the 1948 
U.S. presidential election in which most pollsters predicted that Dewey would defeat 
Truman. A public backlash against survey research occurred again after the British 
general election of 1992, when the Conservative government won reelection despite 
the predictions from all but one of the major polling organizations that it would be a 
dead heat or that Labour would win. One member of Parliament expressed his opinion 
that “extrapolating what tens of millions are thinking from a tiny sample of opinions 
affronts human intelligence and negates true freedom of thought.” 

Some people insist that only a complete census, in which every element of the 
population is measured, will be satisfactory. This objection to sampling has a long 
history. When Anders Kiaer, director of Norwegian statistics, proposed using sampling 
for collecting official governmental statistics (Kiaer, 1897), his proposal was by no 
means universally well received. Opponents of sampling argued that sampling was 
dangerous, and that samples would never be able to replace a census. Within a few 
years, however, the international statistical community was largely persuaded that 
representative samples are a good thing, although probability samples were not widely 
used until the 1930s and 1940s. 

For small populations, a census may of course be practical. If you want to 
know about the employment history of 2005 Arizona State University graduates who 
majored in mathematics, it would be sensible to try to contact all of them. If all of 
the graduates respond, then estimates from the survey will have no sampling error. 
The estimates will have nonsampling errors, however, if the questions are poor or if 
respondents give inaccurate information. If some of the graduates do not return the 
questionnaire, then the estimates will likely be biased because of nonresponse. 

In general, taking a complete census of a population uses a great deal of time 
and money, and does not eliminate error. The biggest causes of error in a survey are 
often undercoverage, nonresponse, and sloppiness in data collection. Most of you 
have kept a paper or electronic checkbook register at some time, and essentially keep 
a census of all of the check and deposit amounts. How many of you can say that 
you have never made an error in your checkbook? It is usually much better to take 
a high-quality sample and allocate resources elsewhere, for instance, by being more 
careful in collecting or recording data, or doing follow-up studies, or measuring more 
variables. 

After all, the Literary Digest poll discussed in Example 1.1 predicted the vote 
wrong even in some counties in which it attempted to take a census. The U.S. Decen- 
nial Census, which attempts to enumerate every resident of the country, misses seg- 
ments of the population. Citro et al. (2004) document the coverage error in the 2000 
Census, reporting that black males had greater undercoverage than other demographic 
groups. 
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There are three main justifications for using sampling: 


=» Sampling can provide reliable information at far less cost than a census. With 
probability samples (described in the next chapter), you can quantify the sampling 
error from a survey. In some instances, an observation unit must be destroyed to 
be measured, as when a cookie must be pulverized to determine the fat content. In 
such a case, a sample provides reliable information about the population; a census 
destroys the population and, with it, the need for information about it. 


= Data can be collected more quickly, so estimates can be published in a timely 
fashion. An estimate of the unemployment rate for 2005 is not very helpful if it 
takes until 2015 to interview every household. 


= Finally, and less well known, estimates based on sample surveys are often more 
accurate than those based on a census because investigators can be more careful 
when collecting data. A complete census often requires a large administrative 
organization, and involves many persons in the data collection. With the admin- 
istrative complexity and the pressure to produce timely estimates, many types of 
errors can be injected into the census. In a sample, more attention can be devoted 
to data quality through training personnel and following up on nonrespondents. It 
is far better to have good measurements on a representative sample than unreliable 
or biased measurements on the whole population. 


Deming says, “Sampling is not mere substitution of a partial coverage for a total 
coverage. Sampling is the science and art of controlling and measuring the reliability 
of useful statistical information through the theory of probability” (1950, p. 2). In the 
remaining chapters of this book, we explore this science and art in detail. 


Key Terms 


Census: A survey in which the entire population is measured. 


Coverage: The percentage of the population of interest that is included in the sam- 
pling frame. 


Measurement error: The difference between the response coded and the true value 
of the characteristic being studied for a respondent. 


Nonresponse: Failure of some units in the sample to provide responses to the survey. 


Nonsampling error: An error from any source other than sampling error. Examples 
include nonresponse and measurement error. 


Sampling error: Error in estimation due to taking a sample instead of measuring 
every unit in the population. 


Sampling frame: A list, map, or other specification of units in the population from 
which a sample may be selected. Examples include a list of all university students, a 
telephone directory, or a map of geographic segments. 


Selection bias: Bias that occurs because the actual probabilities with which units are 
sampled differ from the selection probabilities specified by the investigator. 
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For Further Reading 


The American Statistical Association series “What is a Survey?” provides an intro- 
duction to survey sampling, with examples of many of the concepts discussed in 
Chapter |. In particular, see the chapter “Judging the Quality of a Survey.” This series 
is available on the American Statistical Association Survey Research Methods Sec- 
tion website at www.amstat.org/sections/srms/. The American Association of Public 
Opinion Research website, www.aapor.org, contains many resources for the sampling 
practitioner, including a guide to Standards and Best Practices. 

The following three books are recommended for further reading about general 
issues for taking surveys. Groves et al. (2009) discuss statistical and nonstatistical 
issues in survey sampling, with examples from large-scale surveys. Biemer and Lyberg 
(2003) provide a thorough treatment of issues in survey quality. Dillman et al. (2009) 
give practical, research-supported guidance on everything from questionnaire design 
to choice of survey mode to timing of follow-up letters. 

If you are interested in more information on questionnaire design and or on pro- 
cedures for taking social surveys, start with the books by Presser et al. (2004), Fowler 
(1995), Converse and Presser (1986), Schuman and Presser (1981), and Sudman and 
Bradburn (1982). Much recent research has been done in the area of using results 
from cognitive psychology when writing questionnaires: Tanur (1992), Sudman et al. 
(1995), Schwarz and Sudman (1996), Tourangeau et al. (2000), and Bradburn (2004) 
are useful references on the topic. All are clearly written and list other references. In 
addition, many issues of the journal Public Opinion Quarterly have articles dealing 
with questionnaire design. 


A. Introductory Exercises 


For each survey in Exercises 1—20, describe the target population, sampling frame, 
sampling unit, and observation unit. Discuss any possible sources of selection bias or 
inaccuracy of responses. 


The article “What Readers Say about Marijuana” (Parade, July 31, 1994, p. 16) 
reported “More than 75% of the readers who took part in an informal PARADE 
telephone poll say marijuana should be as legal as alcoholic beverages.” The telephone 
poll was announced on page 5 of the June 12 issue; readers were instructed to “Call 
1-900-773-1200, at 75 cents a call, if you would like to answer the following questions. 
Use touch-tone phones only. To participate, call between 8 a.m. EDT [Eastern Daylight 
Time] on Saturday, June 11, and midnight EDT on Wednesday, June 15.” 


A student wants to estimate the percentage of mutual funds whose shares went up in 
price last week. She selects every tenth fund listing in the Mutual Fund pages of the 
newspaper, and calculates the percentage of those in which the share price increased. 


Amazon books (www.amazon.com) summarizes reader reviews of the books it sells. 
Persons who want to review a book can submit a review online; Amazon then reports 
the average rating from all reader reviews on its website. 
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Potential jurors in some jurisdictions are chosen from a list of county residents who are 
registered voters or licensed drivers over age 18. In the fourth quarter of 1994, 100,300 
jury summons were mailed to Maricopa County, Arizona, residents. Approximately 
23,000 of those were returned from the post office as undeliverable. Approximately 
7000 persons were unqualified for service because they were not citizens, were under 
18, were convicted felons, or other reason that disqualified them from serving on a 
jury. An additional 22,000 were excused from jury service because of illness, financial 
hardship, military service, or other acceptable reason. The final sample consists of 
persons who appear for jury duty; some unexcused jurors fail to appear. 


Many scholars and policy makers are interested in the proportion of homeless people 
who are mentally ill. Wright (1988) estimates that 33% of all homeless people are 
mentally ill, by sampling homeless persons who received medical attention from one 
of the clinics in the Health Care for the Homeless (HCH) project. He argues that 
selection bias is not a serious problem because the clinics were easily accessible to 
the homeless and because the demographic profiles of HCH clients were close to 
those of the general homeless population in each city in the sample. Do you agree? 


Approximately 16,500 women returned the Healthy Women Survey that appeared 
in the September 1992 issue of Prevention. The May 1993 issue, reporting on the 
survey, stated that “Ninety-two percent of our readers rated their health as excellent, 
very good or good.” 


A survey is conducted to find the average weight of cows in a region. A list of all farms 
is available for the region, and 50 farms are selected at random. Then the weight of 
each cow at the 50 selected farms is recorded. 


To study nutrient content of menus in boarding homes for the elderly in Washington 
State, Goren et al. (1993) mailed surveys to all 184 licensed homes in Washington 
State, directed to the administrator and food service manager. Of those, 43 were 
returned by the deadline and included menus. 


Entries in the online encyclopedia Wikipedia can be written or edited by anyone with 
Internet access. This has given rise to concern about the accuracy of the information. 
Giles (2005) reports on a Nature study assessing the accuracy of Wikipedia science 
articles. Fifty subjects were chosen “on a broad range of scientific disciplines.” For 
each subject, the entries from Wikipedia and Encyclopaedia Brittanica were sent to 
a relevant expert; 42 sets of usable reviews were returned. The editors of Nature then 
tallied the number of errors reported for each encyclopedia. 


The December 2003 issue of PC World reported the results from a survey of over 
32,000 subscribers asking about reliability and service for personal computers and 
other electronic equipment. The magazine “invited subscribers to take the Web-based 
survey from April | through June 30, 2003” and received 32,051 responses. Survey 
respondents were entered in a drawing to win prizes. They reported that 46% of 
desktop PCs had at least one significant malfunction. 


Karras (2008) reports on a survey conducted by SELF magazine on prevalence of 
eating disorders in women. The survey, posted online at self.com, obtained responses 
from 4000 women. Based on these responses, the article reports that 27% of women 
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in the survey “say they would be ‘extremely upset’ if they gained just 5 pounds’’; it is 
estimated that 10% of women have eating disorders such as anorexia or bulimia. 


Shen and Hsieh (1999) took a purposive sample of 29 higher education institutions; the 
institutions were “representative in terms of institutional type, geographic and demo- 
graphic diversity, religious/nonreligious affiliation, and the public/private dimension” 
(p. 318). They then mailed the survey to 2042 faculty members in the institutions, of 
whom 1219 returned the survey. 


The American Statistical Association sent the following e-mail with subject line “Joint 
Statistical Meetings 2005 Participants Survey” to a sample of persons who attended 
the 2005 Joint Statistical Meetings: “Thank you for attending the 2005 Joint Statistical 
Meetings (JSM) in Minneapolis, Minnesota. We need your help to complete an online 
survey about the JSM. Because the quality of the JSM is very important, a survey is 
being conducted to find out how we might improve future meetings. We would like 
to get your opinion about various aspects of the 2005 meeting your preferences for 
2006 and beyond. 

You are part of a small sample of conference registrants who have been selected 
randomly to participate in the survey. We hope you will take the time to complete this 
short questionnaire online at www.amstat.org/meetings/jsm/2005/survey. In order to 
tabulate and analyze the data, please submit your response by mid-September 2005.” 


Fark and Johnson (1997) report ona survey of professors of education taken in summer 
of 1997 and conclude that there is a large disparity between the views of education 
professors and those of the general public. A sample of 5324 education professors 
was drawn from a population of about 34,000 education professors in colleges and 
universities across the country. A letter was mailed to each professor in the sample in 
May 1997, inviting him or her to participate and to provide a number where he or she 
could be reached during the summer for a telephone interview. During the summer, 
a total of 778 interviews were completed by telephone. An additional 122 interviews 
were obtained by calling professors in the sample at work in August and September. 
To attempt to minimize question order effects, the survey was pretested and some 
questions were asked in random order. 

Respondents were asked which in a series of qualities were “absolutely essential” 
to be imparted to prospective teachers: 84% of the respondents selected having teach- 
ers who are “life-long learners and constantly updating their skills”; 41%, having 
teachers “trained in pragmatic issues of running a classroom such as managing time 
and preparing lesson plans”; 19%, for teachers to “stress correct spelling, grammar, 
and punctuation”; and 12%, for teachers to “expect students to be neat, on time, and 
polite” (p. 30). 


Kripke et al. (2002) claim that persons who sleep 8 or more hours per night have a 
higher mortality risk than persons who sleep 6 or 7 hours. They analyzed data from the 
1982 Cancer Prevention Study II of the American Cancer Society, a national survey 
taken by about 1.1 million people. The survival or date of death was determined for 
about 98% of the sample six years later. Most of the respondents were friends and 
relatives of American Cancer Society volunteers; the purpose of the original survey 
was to explore factors associated with the development of cancer, but the survey also 
contained a few questions about sleep and insomnia. 
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In lawsuits about trademarks, a plaintiff claiming that another company is infringing 
on its trademarks must often show that the marks have a “secondary meaning” in 
the marketplace—that is, potential users of the product associate the trademarks with 
the plaintiff even when the company’s name is missing. In the court case Harlequin 
Enterprises Ltd v. Gulf & Western Corporation (503 F. Supp. 647, 1980), the publisher 
of Harlequin Romances persuaded the court that the cover design for “Harlequin 
Presents” novels had acquired secondary meaning. Part of the evidence presented 
was a survey of 500 women from three cities who identified themselves as readers of 
romance fiction. They were shown copies of unpublished “Harlequin Presents” novels 
with the Harlequin name hidden; over 50% identified the novel as a Harlequin product. 


Theoharakis and Skordia (2003) asked statisticians who responded to their survey to 
rank statistics journals in terms of prestige, importance, and usefulness. They gath- 
ered e-mail addresses for 12,053 statisticians from the online directories of statistical 
organizations and sent an e-mail invitation to each to participate in the online survey 
at www.alba.edu.gr/statsurvey/. A total of 2190 responses were obtained. The authors 
suggest that the results of their survey could help universities making promotion and 
tenure decisions about statistics faculty by providing information about the perceived 
quality of statistics journals. 


Ann Landers (1976) asked readers of her column to respond to the question: “If you 
had it to do over again, would you have children?” About 70% of the readers who 
responded said “No.” She received over 10,000 responses, 80% of those from women. 


The August, 1996 issue of Consumer Reports contained satisfaction ratings for var- 
ious health maintenance organizations used by readers of the magazine. Describing 
the survey, the editors say on p. 40, “The Ratings are based on more than 20,000 
responses to our 1995 Annual Questionnaire about experiences in HMOs between 
May 1994 and April 1995. Those results reflect experiences of Consumer Reports 
subscribers, who are a more affluent and educated cross-section of the U.S. popu- 
lation.” Answer the general questions about target population, sampling frame, and 
units for this survey. Also, do you think that this survey provides valuable informa- 
tion for comparing health plans? If you were selecting an HMO for yourself, which 
information would you rather have: results from this survey, or results from customer 
satisfaction surveys conducted by the individual HMOs? 


Ebersole (2000) studied how students in selected public schools describe their use 
of the Internet. Five school districts in a Western state were selected to give a cross- 
section of urban and rural schools that have Internet access. A survey, administered 
electronically, was installed as the home page in middle and high school media centers 
in the districts “for a period of time to gather approximately 100 responses from each 
school.” Students who had parental permission to access the Internet were permitted 
to access the computer-administered survey. Participation in the survey was voluntary. 


The following questions, quoted in Kinsley (1981), were from a survey conducted by 
Cambridge Reports, Inc., and financed by Union Carbide Corporation. Critique these 
questions. 


Some people say that granting companies tax credits for the taxes they actually pay to 
foreign nations could increase these companies’ international competitiveness. If you 
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knew for a fact that the tax credits for taxes paid to foreign countries would increase 
the money available to US companies to expand and modernize their plants and create 
more jobs, would you favor or oppose such a tax policy? 


Do you favor or oppose changing environmental regulations so that while they still 
protect the public, they cost American businesses less and lower product costs? 


Frankovic (2008) reported that in 1970, a poll conducted by the Harris organization 
for Virginia Slims, a brand of cigarettes marketed primarily to women, had the fol- 
lowing question: “There won’t be a woman president of the U.S. for a long time and 
that’s probably just as well.” Sixty-seven percent of female respondents agreed with 
the statement. Critique this question. 


On March 21, 1993, NBC televised “The First National Referendum—Government 
Reform Presented by Ross Perot.” During the show, 1992 U.S. presidential candidate 
Perot asked viewers to express their opinions by mailing in The National Referendum 
on Government Reform, printed in the March 20 issue of TV Guide. Some of the 
questions on the survey were the following: 


Do you believe that for every dollar of tax increase there should be $2.00 in 
spending cuts with the savings earmarked for deficit and debt reduction? 


Should the President present an overall plan including spending cuts, spending 
increases, and tax increases and present the net result of the overall plan, so that 
the people can know the net result before paying more taxes? 

Should the electoral college be replaced with a popular vote for the Presidential 
election? 


Was this TV forum worthwhile? Do you wish to continue participating as a voting 
member of United We Stand America? 


Critique these questions. 


D. Projects and Activities 


Read the article by Roush (1996), which describes a proposal for using sampling in 
the 2000 U.S. Census. What are the main arguments for using sampling in 2000? 
Against? What do you think? You may also want to read Holden (2009), about issues 
in the 2010 Census. 


(For students of U.S. history.) Eighty-five letters appeared in New York City news- 
papers in 1787 and 1788, with the purpose of drawing support in the state for the 
newly drafted Constitution. Collectively, these letters are known as The Federalist. 
Read Number 54 of The Federalist, in which the author (widely thought to be James 
Madison) discusses using a population census to apportion elected representatives 
and taxes among the states. This article explains part of Article I, Section II of the 
United States Constitution. 

Write a short paper discussing Madison’s view of a population census. What is the 
target population and sampling frame? What sources of bias does Madison mention, 
and how does he propose to reduce bias? What is your reaction to Madison’s plan, 
from a statistical point of view? Where do you think Madison would stand today on the 
issue of using sampling versus complete enumeration to obtain population estimates? 
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Read the article by Horvitz et al. (1995) on Self-Selected Opinion Polls, which they 
term SLOPs. Find an example of a SLOP and explain why results from the poll should 
not be generalized to the population of interest. 


Find a recent survey reported in a newspaper, academic journal, or popular magazine. 
Describe the survey. What are the target population and sampled population? What 
conclusions are drawn about the survey in the article? Do you think those conclusions 
are justified? What are possible sources of bias for the survey? 


Find a survey on the Internet. For example, Survey.Net (www.survey.net) allows you 
to participate in surveys on a variety of subjects; you can find other surveys by search- 
ing online for “survey” or “take survey.” Participate in one of the surveys yourself, 
and write a paragraph or two describing the survey and its results (most online surveys 
allow you to see the statistics from all the persons who have taken the survey). What 
are the target population and sampled population? What biases do you think might 
occur in the results? 


Some polling organizations recruit volunteers for an Internet panel and then take sam- 
ples from the panel to measure public opinion. Volunteer to be in an Internet panel 
for a survey organization (search for “online poll” to find one). What information are 
you asked to provide? Report how the organization produces estimates. 


At about the same time Hite (1987) conducted the survey described in Section 1.1, 
a survey on similar topics was taken in the United Kingdom. Read the article by 
Wadsworth et al. (1993) describing the National Survey of Sexual Attitudes and 
Lifestyles. How do the authors describe potential biases and sources of error in the 
survey? Contrast the possible errors in this survey with those in Hite (1987). 


Simple Probability Samples 


[Kennedy] read every fiftieth letter of the thirty thousand coming weekly to the White House, as well as 
a Statistical summary of the entire batch, but he knew that these were often as organized and 
unrepresentative as the pickets on Pennsylvania Avenue. 


—Theodore Sorensen, Kennedy 


The examples of bad surveys in Chapter 1—for example, the Literary Digest survey 
in Example 1.1—had major flaws that resulted in unrepresentative samples. In this 
chapter, we discuss how to use probability sampling to conduct surveys. In a proba- 
bility sample, each unit in the population has a known probability of selection, and a 
random number table or other randomization mechanism is used to choose the specific 
units to be included in the sample. If a probability sampling design is implemented 
well, an investigator can use a relatively small sample to make inferences about an 
arbitrarily large population. 

In Chapters 2 through 6, we explore survey design and properties of estimators 
for the three major design components used in a probability sample: simple random 
sampling, stratified sampling, and cluster sampling. We shall integrate all these ideas 
in Chapter 7, and show how they are combined in complex surveys such as the U.S. 
National Crime Victimization Survey. To simplify presentation of the concepts, we 
assume for now that the sampled population is the target population, that the sampling 
frame is complete, that there is no nonresponse or missing data, and that there is no 
measurement error. We return to nonsampling errors in Chapter 8. 

As you might suppose, you need to know some probability to be able to understand 
probability sampling. You may want to review the material in Sections A.1 and A.2 
of Appendix A while reading this chapter. 


A 
Types of Probability Samples 


The terms simple random sample, stratified sample, cluster sample, and systematic 
sample are basic to any discussion of sample surveys, so let’s define them now. 
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Chapter 2: Simple Probability Samples 


A simple random sample (SRS) is the simplest form of probability sample. An 
SRS of size n is taken when every possible subset of n units in the population 
has the same chance of being the sample. SRSs are the focus of this chapter 
and the foundation for more complex sampling designs. In taking a random sam- 
ple, the investigator is in effect mixing up the population before grabbing n units. 
The investigator does not need to examine every member of the population for 
the same reason that a medical technician does not need to drain you of blood 
to measure your red blood cell count: Your blood is sufficiently well mixed that 
any sample should be representative. SRSs are discussed in Section 2.3, after we 
present the basic framework for probability samples in Section 2.2. 


In a stratified random sample, the population is divided into subgroups called 
strata. Then an SRS is selected from each stratum, and the SRSs in the strata 
are selected independently. The strata are often subgroups of interest to the 
investigator—for example, the strata might be different regions of the country 
in a survey of people, different types of terrain in an ecological survey, or sizes of 
firms in a business survey. Elements in the same stratum often tend to be more sim- 
ilar than randomly selected elements from the whole population, so stratification 
often increases precision, as we shall see in Chapter 3. 


In a cluster sample, observation units in the population are aggregated into larger 
sampling units, called clusters. Suppose you want to survey Lutheran church 
members in Minneapolis but do not have a list of all church members in the city, 
so you cannot take an SRS of Lutheran church members. However, you do have 
a list of all the Lutheran churches. You can then take an SRS of the churches 
and then subsample all or some church members in the selected churches. In this 
case, the churches form the clusters, and the church members are the observation 
units. It is more convenient to sample at the church level; however, members of 
the same church may have more similarities than Lutherans selected at random 
in Minneapolis, so a cluster sample of 500 Lutherans may not provide as much 
information as an SRS of 500 Lutherans. We shall explore this idea further in 
Chapter 5. 


In a systematic sample, a starting point is chosen from a list of population mem- 
bers using a random number. That unit, and every kth unit thereafter, is chosen to 
be in the sample. A systematic sample thus consists of units that are equally spaced 
in the list. Systematic samples will be discussed in more detail in Sections 2.7 
and 5.5. 


Suppose you want to estimate the average amount of time that professors at your 


university say they spent grading homework in a specific week. To take an SRS, 
construct a list of all professors and randomly select n of them to be your sample. 
Now ask each professor in your sample how much time he or she spent grading 
homework that week—you would of course have to define the words homework and 
grading carefully in your questionnaire. In a stratified sample, you might classify 
faculty by college: engineering, liberal arts and sciences, business, nursing, and fine 
arts. You would then take an SRS of faculty in the engineering college, a separate 
SRS of faculty in liberal arts and sciences, and so on. For a cluster sample, you 
might randomly select 10 of the 60 academic departments in the university and ask 


EXAMPLE 2.1 
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FIGURE 2.1 
Examples of a simple random sample, stratified random sample, cluster sample, and systematic 
sample of 20 integers from the population {1,2,..., 100}. 


0 20 40 60 80 100 


Stratified random sample of 20 numbers from population of 100 numbers 


0 20 40 60 80 100 


Systematic sample of 20 numbers from population of 100 numbers 


each faculty member in those departments how much time he or she spent grading 
homework. A systematic sample could be chosen by selecting an integer at random 
between | and 20; if the random integer is 16, say, then you would include professors 
in positions 16, 36, 56, and so on, in the list. 


Figure 2.1 illustrates the differences among simple random, stratified, cluster, and sys- 
tematic sampling for selecting a sample of 20 integers from the population {1,2,..., 
100}. For the stratified sample, the population was divided into the 10 strata {1,2,..., 
10}, {11,12,...,20},...,{91,92,..., 100}, and an SRS of 2 numbers was drawn 
from each of the 10 strata. This ensures that each stratum is represented in the sam- 
ple. For the cluster sample, the population was divided into 20 clusters {1,2, 3,4, 5}, 
{6, 7,8, 9, 10},..., {96, 97, 98, 99, 100}; an SRS of 4 of these clusters was selected. 
For the systematic sample, the random starting point was 3, so the sample contains 
units 3,8, 13,18,andsoon. us 


All of these methods—simple random sampling, stratified random sampling, clus- 
ter sampling, and systematic sampling—involve random selection of units to be in the 
sample. In an SRS, the observation units themselves are selected at random from the 
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population of observation units; in a stratified random sample, observation units within 
each stratum are randomly selected; in a cluster sample, the clusters are randomly 
selected from the population of all clusters. Each method is a form of probability 
sampling, which we discuss in the next section. 


1] 


Framework for Probability Sampling 


To show how probability sampling works, we need to be able to list the N units in 
the finite population. The finite population, or universe, of N units is denoted by the 
index set 


U = {1,2,...,N}. (2.1) 


Out of this population we can choose various samples, which are subsets of /. The 
particular sample chosen is denoted by S, a subset consisting of n of the units in U/. 

Suppose the population has four units: ¢ = {1,2,3,4}. Six different samples of 
size 2 could be chosen from this population: 


S; = {1,2} S4 = {2,3} 
Sz = {1,3} 55 = {2,4} 
S53= {1,4} So = {3,4} 


In probability sampling, each possible sample S from the population has a known 
probability P(S) of being chosen, and the probabilities of the possible samples 
sum to 1. One possible sample design for a probability sample of size 2 would 
have P(S,)= 1/3, P(S2)=1/6, and P(S¢) = 1/2, and P(S3) = P(S4) = P(Ss) = 0. 
The probabilities P(S|), P(S2), and P(S¢) of the possible samples are known before 
the sample is drawn. One way to select the sample would be to place six labeled balls 
in a box; two of the balls are labeled 1, one is labeled 2, and three are labeled 6. Now 
choose one ball at random; if a ball labeled 6 is chosen, then S¢ is the sample. 

In a probability sample, since each possible sample has a known probability 
of being the chosen sample, each unit in the population has a known probability of 
appearing in our selected sample. We calculate 


zr; = P(unit i in sample) (2.2) 


by summing the probabilities of all possible samples that contain unit i. In probability 
sampling, the zr; are known before the survey commences, and we assume that 77; > 0 
for every unit in the population. For the sample design described above, 7; = P(S,)+ 
P(S2) + P(S3) = 1/2, m2 = P(S1) + P(S4) + P(Ss) = 1/3, 13 = P(S2) + P(S4) + 
P(S6) = 2/3, and m4 = P(S3) + P(Ss5) + P(S6) = 1/2. 

Of course, we never write all possible samples down and calculate the probability 
with which we would choose every possible sample—this would take far too long. 
But such enumeration underlies all of probability sampling. Investigators using a 
probability sample have much less discretion about which units are included in the 
sample, so using probability samples helps us avoid some of the selection biases 
described in Chapter 1. In a probability sample, the interviewer cannot choose to 
substitute a friendly looking person for the grumpy person selected to be in the sample 
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by the random selection method. A forester taking a probability sample of trees cannot 
simply measure the trees near the road but must measure the trees designated for 
inclusion in the sample. Taking a probability sample is much harder than taking a 
convenience sample, but a probability sampling procedure guarantees that each unit 
in the population could appear in the sample and provides information that can be 
used to assess the precision of statistics calculated from the sample. 

Within the framework of probability sampling, we can quantify how likely it is 
that our sample is a “good” one. A single probability sample is not guaranteed to 
be representative of the population with regard to the characteristics of interest, but 
we can quantify how often samples will meet some criterion of representativeness. 
The notion is the same as that of confidence intervals: We do not know whether the 
particular 95% confidence interval we construct for the mean contains the true value 
of the mean. We do know, however, that if the assumptions for the confidence interval 
procedure are valid and if we repeat the procedure over and over again, we can expect 
95% of the resulting confidence intervals to contain the true value of the mean. 

Let y; be a characteristic associated with the ith unit in the population. We consider 
y; to be a fixed quantity; if Farm 723 is included in the sample, then the amount of 
corn produced on Farm 723, y723, is known exactly. 


To illustrate these concepts, let’s look at an artificial situation in which we know the 
value of y; for each of the N = 8 units in the whole population. The index set for the 
population is 


U = {1,2,3,4,5, 6, 7, 8}. 


The values of y; are 


There are 70 possible samples of size 4 that may be drawn without replacement 
from this population; the samples are listed in file samples.dat on the website. If the 
sample consisting of units {1,2,3,4} were chosen, the corresponding values of y; 
would be 1, 2, 4, and 4. The values of y; for the sample {2,3,6,7} are 2, 4, 7, and 
7. Define P(S) = 1/70 for each distinct subset of size four from //. As you will see 
after you read Section 2.3, this design is an SRS without replacement. Each unit is in 
exactly 35 of the possible samples, so 7; = 1/2 fori=1,2,...,8. 

A random mechanism is used to select one of the 70 possible samples. One possible 
mechanism for this example, because we have listed all possible samples, is to generate 
a random number between | and 70 and select the corresponding sample. With large 
populations, however, the number of samples is so great that it is impractical to list 
all possible samples—instead, another method is used to select the sample. Methods 
that will give an SRS will be described in Section 2.3. a 


Most results in sampling rely on the sampling distribution of a statistic, the 
distribution of different values of the statistic obtained by the process of taking all 
possible samples from the population. A sampling distribution is an example of a 
discrete probability distribution. 
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Suppose we want to use a sample to estimate a population quantity, say the pop- 
ulation total t= yo y;. One estimator we might use for f is 7g = NY, where Vg is 
the average of the y;’s in S, the chosen sample. In our example, t = 40. If the sample 
S consists of units 1, 3, 5, and 6, then ts =8 x (1+4+7+47)/4=38. Since we 
know the whole population here, we can find 7s for each of the 70 possible samples. 
The probabilities of selection for the samples give the sampling distribution of ?: 


Pii=k}= D> P(S). 
Stis=k 


The summation is over all samples S for which 7g = k. We know the probability P(S) 
with which we select a sample S because we take a probability sample. 


EXAMPLE 2.3 The sampling distribution of 7 for the population and sampling design in Example 2.2 
derives entirely from the probabilities of selection for the various samples. Four 
samples ({3,4,5,6}, {3,4,5,7}, {3,4,6,7}, and {1,5,6,7}) result in the estimate 7= 44, 
so P{t = 44} = 4/70. For this example, we can write out the sampling distribution of 
t because we know the values for the entire population. 


k | 22 ~=28 30 32 34 36 38 40 42 44 46 48 50 52 58 
1 6 2 3 7 4 6 12 6 4 7 3 2 6 1 


Figure 2.2 displays the sampling distribution. = 


The expected value of 7, E[7], is the mean of the sampling distribution of ?: 


E[?] =) is P(S) (2.3) 


S 
= So kPG =k). 
k 


FIGURE 2.2 
Sampling distribution of the sample total in Example 2.3. 


- 


jt 
=) 


oo LI \ i 1 
30 40 50 


20 60 


Probability 


i=) 
nn 


Estimate of t from Sample 


2.2 Framework for Probability Sampling a] 


The expected value of the statistic is the weighted average of the possible sample 
values of the statistic, weighted by the probability that particular value of the statistic 
would occur. 

The estimation bias of the estimator 7 is 


Bias [7] = E[?]—v. (2.4) 


If Bias[7] =0, we say that the estimator 7 is unbiased for f. For the data in Example 2.2 
the expected value of 7 is 


1 6 1 
E{t] = 79 (22) + 79 28) eee Ste 79 58) — 40. 


Thus, the estimator is unbiased. 

Note that the mathematical definition of bias in (2.4) is not the same thing as 
the selection or measurement bias described in Chapter 1. All indicate a systematic 
deviation from the population value, but from different sources. Selection bias is due 
to the method of selecting the sample—often, the investigator acts as though every 
possible sample S has the same probability of being selected, but some subsets of 
the population actually have a different probability of selection. With undercoverage, 
for example, the probability of including a unit not in the sampling frame is zero. 
Measurement bias means that the y;’s are not really the quantities of interest, so 
although 7 may be unbiased in the sense of (2.4) for t= =o y;, t itself would not 
be the true total of interest. Estimation bias means that the estimator chosen results 
in bias—for example, if we used ts= » ics Yi and did not take a census, t would be 
biased. To illustrate these distinctions, suppose you wanted to estimate the average 
height of male actors belonging to the Screen Actors Guild. Selection bias would 
occur if you took a convenience sample of actors on the set—perhaps taller actors are 
more or less likely to be working. Measurement bias would occur if your tape measure 
inaccurately added 3 centimeters (cm) to each actor’s height. Estimation bias would 
occur if you took an SRS from the list of all actors in the Guild, but estimated mean 
height by the average height of the six shortest men in the sample—the sampling 
procedure is good, but the estimator is bad. 

The variance of the sampling distribution of 7 is 


V@)=EG-ElP)1= > PS)[is — EG)P. (2.5) 


all possible 
samples S 


For the data in Example 2.2, 
o 1 1 3840 
V@) = —(22 — 40) +--+ —(58 — 40) = —— = 54.86. 
= a6 ee ea ) 70 


Because we sometimes use biased estimators, we often use the mean squared error 
(MSE) rather than variance to measure the accuracy of an estimator. 

Et@—1)"] 

[(@— El#]+ El?]—97] 

[@ — El#])"]+ El#]— 1 + 2El@ — EL7DEL?] - 9] 

(?) + [Bias(?)]”. 


MSE[? | 


E 
E 
V 
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FIGURE 2.3 

Unbiased, precise, and accurate archers. Archer A is unbiased—the average position of all 
arrows is at the bull’s-eye. Archer B is precise but not unbiased—all arrows are close together 
but systematically away from the bull’s-eye. Archer C is accurate—all arrows are close together 
and near the center of the target. 


Archer A Archer B Archer C 


Thus, an estimator 7 of t is unbiased if E(?) = t, precise if V(7) = E[(? — E[?])] is 
small, and accurate if MSE[7] = E[(¢ — t)*] is small. A badly biased estimator may be 
precise but it will not be accurate; accuracy (MSE) is how close the estimate is to the 
true value, while precision (variance) measures how close estimates from different 
samples are to each other. Figure 2.3 illustrates these concepts. 

In summary, the finite population 2/ consists of units {1,2,...,N} whose mea- 
sured values are {y1, y2,..., yv}. We select a sample S of n units fromZ/ using the prob- 
abilities of selection that define the sampling design. The y,’s are fixed but unknown 
quantities—unknown unless that unit happens to appear in our sample S. Unless we 
make additional assumptions, the only information we have about the set of y,’s in 
the population is in the set {y; : i€ S}. 

You may be interested in many different population quantities from your pop- 
ulation. Historically, however, the main impetus for developing theory for sample 
surveys has been estimating population means and totals. Suppose we want to esti- 
mate the total number of persons in Canada who have diabetes, or the average number 
of oranges produced per orange tree. The population total is 


N 
t=) 1% 
i=1 


and the mean of the population is 


ia 
y= yj- 
ae 2 

Almost all populations exhibit some variability; for example, households have differ- 


ent incomes and trees have different diameters. Define the variance of the population 
values about the mean as 


N 
ll 
8 =—— J) -Sy)- (2.6) 
N-1 = 
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The population standard deviation is S = V'S?. 

It is sometimes helpful to have a special notation for proportions. The proportion 
of units having a characteristic is simply a special case of the mean, obtained by 
letting y; = 1 if unit i has the characteristic of interest, and y; = 0 if unit 7 does not 
have the characteristic. Let 


number of units with the characteristic in the population 
N : 


For the population in Example 2.2, let 


— 1 if unit 7 has the value 7 
vi = 0 if unit i does not have the value 7 


Let ps = vies yi/4, the proportion of 7s in the sample. The list of all possible 
samples in the data file samples.dat has 5 samples with no 7s, 30 samples with exactly 
one 7, 30 samples with exactly two 7s, and 5 samples with three 7s. Since one of the 
possible samples is selected with probability 1/70, the sampling distribution of p is!: 


Ale 
N 


3 
4 
5 30 30 5 
70 70 70 70 


Simple Random Sampling 


Simple random sampling is the most basic form of probability sampling, and provides 
the theoretical basis for the more complicated forms. There are two ways of taking 
a simple random sample: with replacement, in which the same unit may be included 
more than once in the sample, and without replacement, in which all units in the 
sample are distinct. 

A simple random sample with replacement (SRSWR) of size n from a pop- 
ulation of N units can be thought of as drawing n independent samples of size 1. 
One unit is randomly selected from the population to be the first sampled unit, with 
probability 1/N. Then the sampled unit is replaced in the population, and a second 
unit is randomly selected with probability 1/N. This procedure is repeated until the 
sample has n units, which may include duplicates from the population. 

In finite population sampling, however, sampling the same person twice provides 
no additional information. We usually prefer to sample without replacement, so that 
the sample contains no duplicates. A simple random sample without replacement 
(SRS) of size n is selected so that every possible subset of n distinct units in the 


population has the same probability of being selected as the sample. There are 


possible samples (see Appendix A), and each is equally likely, so the probability of 


' An alternative derivation of the sampling distribution is in Exercise A.2 in Appendix A. 


unit 7 
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selecting any individual sample S of n units is 


P(S) = 1 x ni(N — nyt (2.7) 


Ci 


As a consequence of this definition, the probability that the ith unit appears in the 
sample is 2; =n/N, as shown in Section 2.8. 

To take an SRS, you need a list of all observation units in the population; this list is 
the sampling frame. In an SRS, the sampling unit and observation unit coincide. Each 
unit is assigned a number, and a sample is selected so that each possible sample of 
size n has the same chance of being the sample actually selected. This can be thought 
of as drawing numbers out of a hat; in practice, computer-generated pseudo-random 
numbers are usually used to select a sample. 

One method for selecting an SRS of size n from a population of size N is to 
generate N random numbers between 0 and 1, then select the units corresponding to 
the n smallest random numbers to be the sample. For example, if VN = 10 andn = 4, 
we generate 10 numbers between 0 and 1: 


1 2 3 4 5 6 f) 8 9 10 


random number 
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0.837 0.636 0.465 0.609 0.154 0.766 0.821 0.713 0.987 0.469 


The smallest 4 of the random numbers are 0.154, 0.465, 0.469 and 0.609, leading 
to the sample with units {3,4,5, 10}. Other methods that might be used to select an 
SRS are described in Example 2.5 and Exercises 21 and 29. Several survey software 
packages will select an SRS from a list of N units; the file srsselect.sas on the website 
gives code for selecting an SRS using SAS PROC SURVEYSELECT. 


The U.S. government conducts a Census of Agriculture every five years, collecting 
data on all farms (defined as any place from which $1000 or more of agricultural 
products were produced and sold) in the 50 states.” The Census of Agriculture provides 
data on number of farms, the total acreage devoted to farms, farm size, yield of different 
crops, and a wide variety of other agricultural measures for each of the N = 3078 
counties and county-equivalents in the United States. The file agpop.dat contains the 
1982, 1987, and 1992 information on the number of farms, acreage devoted to farms, 
number of farms with fewer than 9 acres, and number of farms with more than 1000 
acres for the population. 

To take an SRS of size 300 from this population, I generated 300 random numbers 
between 0 and | on the computer, multiplied each by 3078, and rounded the result up 
to the next highest integer. This procedure generates an SRSWR. If the population is 
large relative to the sample, it is likely that each unit in the sample only occurs once in 
the list. In this case, however, 13 of the 300 numbers were duplicates. The duplicates 
were discarded, and replaced with new randomly generated numbers between | and 


2The Census of Agriculture was formerly conducted by the U.S. Census Bureau; it is currently conducted 
by the U.S. National Agricultural Statistics Service (NASS). More information about the census and 
selected data are available on the web through the NASS material on www.fedstats.gov; also see 
www.agcensus.usda.gov. 
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FIGURE 2.4 

Histogram: number of acres devoted to farms in 1992, for an SRS of 300 counties. Note the 
skewness of the data. Most of the counties have fewer than 500,000 acres in farms; some 
counties, however, have more than 1.5 million acres in farms. 
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3078 until all 300 numbers were distinct; the set of random numbers generated is in 
file selectrs.dat, and the data set for the SRS is in agsrs.dat. 

The counties selected to be in the sample may not “feel” very random at first 
glance. For example, counties 2840, 2841, and 2842 are all in the sample while none 
of the counties between 2740 and 2787 appear. The sample contains 18% of Virginia 
counties, but no counties in Alaska, Arizona, Connecticut, Delaware, Hawati, Rhode 
Island, Utah, or Wyoming. There is a quite natural temptation to want to “adjust” the 
random number list, to spread it out a bit more. If you want a random sample, you 
must resist this temptation. Research, beginning with Neyman (1934), has repeatedly 
demonstrated that purposive samples often do not represent the population on key 
variables. If you deliberately substitute other counties for those in the randomly gen- 
erated sample, you may be able match the population on one particular characteristic 
such as geographic distribution; however, you will likely fail to match the population 
on characteristics of interests such as number of farms or average farm size. If you 
want to ensure that all states are represented, do not adjust your randomly selected 
sample purposively but take a stratified sample (to be discussed in Chapter 3). 

Let’s look at the variable acres92, the number of acres devoted to farms in 1992. 
A small number of counties in the population are missing that value—in some cases, 
the data are withheld to prevent disclosing data on individual farms. Thus we first 
check to see the extent of the missing data in our sample. Fortunately, our sample 
has no missing data (Exercise 23 tells how likely such an occurrence is). Figure 2.4 
displays a histogram of the acreage devoted to farms in each of the 300 counties. m= 


For estimating the population mean y,, from an SRS, we use the sample mean 


1 
Is == > yi. (2.8) 


icS 
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In the following, we use y to refer to the sample mean and drop the subscript S unless 
it is needed for clarity. As will be shown in Section 2.8, y is an unbiased estimator of 
the population mean y,,, and the variance of y is 


v= 2-2 2.9 
@=—(1-5) 2.9) 
for S* defined in (2.6). The variance V(y) measures the variability among estimates 
of y, from different samples. 

The factor (1 — n/N) is called the finite population correction (fpc). Intuitively, 
we make this correction because with small populations the greater our sampling 
fraction n/N, the more information we have about the population and thus the smaller 
the variance. If N = 10 and we sample all 10 observations, we would expect the 
variance of y to be 0 (which it is). If N = 10, there is only one possible sample S of 
size 10 without replacement, with y.; = yy, so there is no variability due to taking a 
sample. For a census, the fpc, and hence V(y), is 0. When the sampling fraction n/N 
is large in an SRS without replacement, the sample is closer to a census, which has 
no sampling variability. 

For most samples that are taken from extremely large populations, the fpc is 
approximately 1. For large populations it is the size of the sample taken, not the 
percentage of the population sampled, that determines the precision of the estimator: 
If your soup is well stirred, you need to taste only one or two spoonfuls to check the 
seasoning, whether you have made | liter or 20 liters of soup. A sample of size 100 
from a population of 100,000 units has almost the same precision as a sample of size 
100 from a population of 100 million units: 


5? 99,900 iS 


YIY] = 766 Y00,000 ~ 100°"??? os OO.N00 
Ss? 99,999,900 $2 
Viei= = (0.999999) for N = 100,000,000 
by] = 700 100,000,000 ~ 100‘ nO 


The population variance S$”, which depends on the values for the entire population, 
is in general unknown. We estimate it by the sample variance: 


1 _ 
v= 201-9. (2.10) 


An unbiased estimator of the variance of y is (see Section 2.8) 


s 


n n 
v5 =(1-=)-. 2.11 
) Nn (2.11) 
The standard error (SE) is the square root of the estimated variance of y: 
SE) = (1 a (2.12) 
nee N/ n- } 


The population standard deviation is often related to the mean. A population of 
trees might have a mean height of 10 meters (m) and standard deviation of one m. 
A population of small cacti, however, with a mean height of 10 cm, might have a 
standard deviation of 1 cm. The coefficient of variation (CV) of the estimator y is a 
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measure of relative variability, which may be defined when y, 4 0 as: 


5) VV) _ Fy 
CV(y) = 2.13 
Ove EW) ee i 


If tree height is measured in meters, then y, and S are also in meters. The CV does 
not depend on the unit of measurement: In this example, the trees and the cacti have 
the same CV. We can estimate the CV of an estimator using the standard error divided 
by the mean (defined only when the mean is nonzero): In an SRS, 


CV) = =e Noa (2.14) 


The estimated CV is thus the standard error expressed as a percentage of the mean. 
All these results apply to the estimation of a population total, ft, since 


N 
t= yi = Nyy. 
i=1 


To estimate f, we use the unbiased estimator 


? = Ny. (2.15) 
Then, from (2.9), 
VF) =NPVG) =N7(1- ae (2.16) 
N/ n 
and 
Ravix n 2 
w= nr(1 sa ae (2.17) 


Note that CV(f) = V/ V(t)/E() is the same as CV(y). 


For the data in Example 2.5, N =3078 and n= 300, so the sampling fraction is 
300/3078 = 0.097. The sample statistics are y = 297,897, s = 344,551.9, and t= Ny= 
916,927,110. Standard errors are 


300 
SE [y] = = (1- an) = 18,898.434428 


and 


E [7] = (3078)(18,898.434428) = 58,169,381, 
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and the estimated coefficient of variation is 
CV ii] =CV BI 
SE [y] 
y 


18,898.434428 
297,897 


= 0.06344. 


Since these data are so highly skewed, we should also report the median number 
of farm acres in a county, which is 196,717. = 


We might also want to estimate the proportion of counties in Example 2.5 with 
fewer than 200,000 acres in farms. Since estimating a proportion is a special case 
of estimating a mean, the results in (2.8)-(2.14) hold for proportions as well, and 
they take a simple form. Suppose we want to estimate the proportion of units in the 
population that have some characteristic—call this proportion p. Define y; to be 1 if 
the unit has the characteristic and to be 0 if the unit does not have that characteristic. 
Then p= ~ yi/N = yy, and p is estimated by p = y. Consequently, p is an unbiased 
estimator of p. For the response y;, taking on values 0 or 1, 


Sa ye, 4 — PY _ ey? — 2p 1 yi + Nv 3 


as i= py 
N-1 Vet noi 
Thus, (2.9) implies that 
a N—n\ p(l—p) 
Vv) = (= re . Pe (2.18) 
Also, 
1 n 
2 m2 ~ ~ 
=. f — 1 F 
5 seh py = et? 
so from (2.11), 
wa (, _ #\ BUA) 
1H = (1-5) = (2.19) 


For the sample described in Example 2.5, the estimated proportion of counties with 
fewer than 200,000 acres in farms is 
153 


= =0.51 
300 


p= 


with standard error 


ae /( S00) ©5049) _ 0.0075 ; 


3078 299 
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Note: InanSRS, 7; =n/N forall units i= 1,..., NM. However, many other probability 
sampling designs also have 2; =n/N for all units but are not SRSs. To have an SRS, 
it is not sufficient for every individual to have the same probability of being in the 
sample; in addition, every possible sample of size n must have the same probability 


1/ ( se of being the sample selected, as defined in (2.7). The cluster sampling design 


in Example 2.1, in which the population of 100 integers is divided into 20 clusters 
{1, 2, 3,4, 5}, {6, 7, 8,9, 10},..., {96, 97, 98, 99, 100} and an SRS of 4 of these clusters 
selected, has 2; = 20/100 for each unit in the population but is not an SRS of size 
20 because different possible samples of size 20 have different probabilities of being 
selected. To see this, let’s look at two particular subsets of {1,2,..., 100}. Let S; be 
the cluster sample depicted in the third panel of Figure 2.1, with 


S| = {1,2,3,4,5, 46, 47, 48, 49, 50, 61, 62, 63, 64, 65, 81, 82, 83, 84, 85}, 
and let 
S2 = {1,6, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61, 66, 71, 76, 81, 86, 91, 96}. 
The cluster sampling design specifies taking an SRS of 4 of the 20 clusters, so 
P(S\)=1/ a = 4!(20 — 4)!/20! = 1/4845. Sample S2 cannot occur under this 


design, however, so P(S7) = 0. An SRS with n = 20 from a population with N = 100 
would have 


1 20!(100 — 20)! 1 
P S = => = 
- es. 100! 5.359834 x 107° 
20 


for every subset S of size 20 from the population {1,2,..., 100}. 


Sampling Weights 


In (2.2), we defined zr; to be the probability that unit 7 is included in the sample. In 
probability sampling, these inclusion probabilities are used to calculate point estimates 
such as ¢ and y. Define the sampling weight, for any sampling design, to be the 
reciprocal of the inclusion probability: 


w=. (2.20) 
Tj 
The sampling weight of unit i can be interpreted as the number of population units 
represented by unit i. 

In an SRS, each unit has inclusion probability 2; =n/N; consequently, all sam- 
pling weights are the same with w; = 1/2; =N/n. We can thus think of every unit in 
the sample as representing itself plus N /n—1 of the unsampled units in the population. 
Note that for an SRS, 


N 
Se Ns 


iceS icS 


EXAMPLE 2.8 


9 


A) Chapter 2: Simple Probability Samples 


So wii = > i = i, 


icS icS 
and 
wi 
ieS = t =y 
Dae 
ieS 


All weights are the same in an SRS—that is, every unit in the sample represents 
the same number of units, N/n, in the population. We call such a sample, in which 
every unit has the same sampling weight, a self-weighting sample. 


Let’s look at the sampling weights for the sample described in Example 2.5. Here, 
N = 3078 and n = 300, so the sampling weight is w; = 3078/300 = 10.26 for each 
unit in the sample. The first county in the data file agsrs.dat, Coffee County, Alabama, 
thus represents itself and 9.26 counties from the 2778 counties not included in the 
sample. We can create a column of sampling weights as follows: 


A B Cc D E 

County State acres92 weight weight*acres92 
Coffee County AL 175,209 10.26 1,797,644.34 
Colbert County AL 138,135 10.26 1417,265.10 
Lamar County AL 56,102 10.26 575,606.52 
Marengo County AL 199,117 10.26 2,042,940.42 
Marion County AL 89,228 10.26 915,479.28 
Tuscaloosa County AL 96,194 10.26 986,950.44 
Columbia County AR 57,253 10.26 587,415.78 
Pleasants County WV 15,650 10.26 160,569.00 
Putnam County WV 55,827 10.26 572,785.02 
Sum 89,369,114 3078 916,927,109.60 


The last column is formed by multiplying columns C and D, so the entries are wjy;. 
We see that ae s Wiyi = 916,927,110, which is the same value we obtained for the 
estimated population total in Example 2.5. = 


Confidence Intervals 


When you take a sample survey, it is not sufficient to simply report the average height 
of trees or the sample proportion of voters who intend to vote for Candidate B in 
the next election. You also need to give an indication of how accurate your estimates 
are. In statistics, confidence intervals (CIs) are used to indicate the accuracy of an 
estimate. 


EXAMPLE 2.9 
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A 95% confidence interval is often explained heuristically: If we take samples 
from our population over and over again, and construct a confidence interval using 
our procedure for each possible sample, we expect 95% of the resulting intervals to 
include the true value of the population parameter. 

In probability sampling from a finite population, only a finite number of possible 
samples exist and we know the probability with which each will be chosen; if we 
were able to generate all possible samples from the population, we would be able to 
calculate the exact confidence level for a confidence interval procedure. 


Return to Example 2.2, in which the entire population is known. Let’s choose an arbi- 
trary procedure for calculating a confidence interval, constructing interval estimates 
for t as 


CI(S) = [#s — 48g, ts + 45g]. 


There is no theoretical reason to choose this procedure, but it will illustrate the concept 
of aconfidence interval. Define u(S) to be | if CI(S) contains the true population value 
40, and 0 if CI(S) does not contain 40. Since we know the population, we can calculate 
the confidence interval CI(S) and the value of u(S) for each possible sample S. Some 
of the 70 confidence intervals are shown in Table 2.1 (all entries in the table are 
rounded to two decimals). 

Each individual confidence interval either does or does not contain the population 
total 40. The probability statement in the confidence interval is made about the col- 
lection of all possible samples; for this confidence interval procedure and population, 
the confidence level is 


Y > P(S)u(S) = 0.77. 
S 
That means that if we take an SRS of four elements without replacement from this 
population of eight elements, there is a 77% chance that our sample is one of the 
“good” ones whose confidence interval contains the true value 40. This procedure 
thus creates a 77% confidence interval. 

Of course, in real life, we only take one sample and do not know the value of the 
population total ¢. Without further investigation, we have no way of knowing whether 
the sample we obtained is one of the “good” ones, such as S = {2,3,5, 6}, or one 
of the “bad” ones such as S = {4, 6,7, 8}. The confidence interval gives us only a 
probabilistic statement of how often we expect to be right. = 


In practice, we do not know the values of statistics from all possible samples, 
so we cannot calculate the exact confidence coefficient for a procedure as done in 
Example 2.9. In your introductory statistics class, you relied largely on asymptotic 
(as the sample size goes to infinity) results to construct confidence intervals for an 
unknown mean jz. The central limit theorem says that if we have a random sample 
with replacement, then the probability distribution of ./n(y— 2) converges to anormal 
distribution as the sample size n approaches infinity. 

In most sample surveys, though, we only have a finite population. To use asymp- 
totic results in finite population sampling, we pretend that our population is itself 
part of a larger superpopulation; the superpopulation is itself a subset of a larger 


4) Chapter 2: Simple Probability Samples 


TABLE 2.1 
Confidence intervals for possible samples from small population 


Sample S y,lES ts SS Cl(S) u(S) 
{1, 2,3, 4} 1,2,4,4 22 1.50 [16.00, 28.00] 0 
{1, 2,3, 5} 1,2,4,7 28 2.65 [17.42, 38.58] 0 
{1, 2,3, 6} 1,2,4,7 28 2.65 [17.42, 38.58] 0 
{1,2,3,7} 1,2,4,7 28 2.65 [17.42, 38.58] 0 
{1, 2,3, 8} 1,2,4,8 30 3.10 [17.62, 42.38] 1 
{1,2,4, 5} 1,2,4,7 28 2.65 [17.42, 38.58] 0) 
{1, 2,4, 6} 1,2,4,7 28 2.65 [17.42, 38.58] 0 
{1,2,4, 7} 1,2,4,7 28 2.65 [17.42, 38.58] 0 
{1, 2,4, 8} 1,2,4,8 30 3.10 [17.62, 42.38] 1 
{1, 2,5, 6} 1,2,7,7 34 3.20 [21.19, 46.81] 1 
{2, 3, 4, 8} 2,4,4,8 36 2.52 [25.93, 46.07] 1 
{2, 3,5, 6} 2,4,7,7 40 2.45 [30.20, 49.80] 1 
{2, 3,5, 7} 2,4,7,7 40 2.45 [30.20, 49.80] 1 
{2, 3,5, 8} 2,4,7,8 42 2.75 [30.98, 53.02] 1 
{2, 3, 6, 7} 2,4,7,7 40 2.45 [30.20, 49.80] 1 
{4,5, 6, 8} 4,7,7,8 52 1.73 [45.07, 58.93] 0 
{4,5,7, 8} 4,7,7,8 52 1.73 [45.07, 58.93] 0 
{4, 6,7, 8} 4,7,7,8 52 1.73 [45.07, 58.93] 0 
{5, 6,7, 8} 7,7,7,8 58 0.50 [56.00, 60.00] 0 


superpopulation, and so on until the superpopulations are as large as we could wish. 
Our population is embedded in a series of increasing finite populations. This embed- 
ding can give us properties such as consistency and asymptotic normality. One can 
imagine the superpopulations as “alternative universes” in a science fiction sense— 
what might have happened if circumstances were slightly different. 

Hajek (1960) proves a central limit theorem for simple random sampling without 
replacement (also see Lehmann, 1999, Sections 2.8 and 4.4, for a derivation). In 
practical terms, Hajek’s theorem says that if certain technical conditions hold and if 
n, N, and N — nare all “sufficiently large,” then the sampling distribution of 


ea 
(\- Ry 
N/ ./n 
is approximately normal (Gaussian) with mean 0 and variance |. A large-sample 
100(1 — a)% CI for the population mean is 


[F- ee: pee (2.21) 
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where Z,/2 is the (1 — w/2)th percentile of the standard normal distribution. In simple 
random sampling without replacement, 95% of the possible samples that could be 
chosen will give a 95% CI for y,, that contains the true value of y,. Usually, S is 
unknown, so in large samples s is substituted for S with little change in the approxi- 
mation; the large-sample CI is 


LY — Za/2SE(), ¥ + Za/2SEQ)]. 


In practice, we often substitute f/2,,_1, the (1 —a/ 2)" percentile of a ¢ distribution 
with n — 1 degrees of freedom, for zy/2. For large samples, ty2,,—-1 * Za/2. In smaller 
samples, using fy/2,,—-1 instead of zy/2 produces a wider CI. Most software packages 
use the following CI for the population mean from an SRS: 


[ Nas 2 at fear — | (2.22) 
y a/2,n—1 N aa a/2,n—1 N Jn > : 


The imprecise term “sufficiently large” occurs in the central limit theorem because 
the adequacy of the normal approximation depends on n and on how closely the pop- 
ulation {y;,i=1,...,N} resembles a population generated from the normal distri- 
bution. The “magic number” of n = 30, often cited in introductory statistics books 
as a sample size that is “sufficiently large” for the central limit theorem to apply, 
often does not suffice in finite population sampling problems. Many populations we 
sample are highly skewed—we may measure income, number of acres on a farm that 
are devoted to corn, or the concentration of mercury in Minnesota lakes. For all of 
these examples, we expect most of the observations to be relatively small, but a few 
to be very, very large, so that a smoothed histogram of the entire population would 
look like this: 


Thinking of observations as generated from some distribution is useful in deciding 
whether it is safe to use the central limit theorem. If you can think of the generating 
distribution as being somewhat close to normal, it is probably safe to use the central 
limit theorem with a sample size as small as 50. If the sample size is too small and the 
sampling distribution of y not approximately normal, we need to use another method, 
relying on distributional assumptions, to obtain a CI for y,,. Such methods rely on a 
model-based perspective for sampling (Section 2.9). 
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Sugden et al. (2000) discuss and extend Cochran’s rule (Cochran, 1977, p. 42) for 
the sample size needed for the normal approximation to be adequate. They recommend 
a minimum sample size of 


TL Oi -Fu*) 
Mmin = 28 + 25 ae. cee (2.23) 


for the CI in (2.22) to have confidence level approximately equal to 1 — a. The 
quantity yy Oi - vay /(NS*) is the skewness of the population; if the skewness 
is large, a large sample size is needed for the normal approximation to be valid. 
Another approach for considering whether the sample size is adequate for a normal 
approximation to be used is to look at a bootstrap approximation to the sampling 
distribution (see Exercise 25). 


The histogram in Figure 2.4 exhibited an underlying distribution for farm acreage that 
was far from normal. Is the sample size large enough to apply the central limit theorem? 
We substitute the sample values s = 344,551.9 and Dee (y;-y)3/n = 1.05036 x 10!” 


for the population quantities S and yoy (vi — Yy)°/N in (2.23), giving an estimated 
minimum sample size of 


1.05036 x 10!7]? 
S| ~ 193, 
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For this example, our sample of size 300 appears to be sufficiently large for the 
sampling distribution of y to be approximately normal. 

For the data in Example 2.5, an approximate 95% CI for yy, using ty/2,299 = 
1.968, is 


[297,897 — (1.968)(18,898.434), 297,897 + (1.968)(18,898.434)] 
= [260,706, 335,088]. 


For the population total t, an approximate 95% CI is 


[916,927,110 — 1.968(58,169,381), 916,927,110 + 1.968(58,169,381)] 
= [8.02 x 10%, 1.03 x 10°]. 


For estimating proportions, the usual criterion that the sample size is large enough 
to use the normal distribution if both np > 5 and n(1 — p) => 5 is a useful guideline. 
A 95% CI for the proportion of counties with fewer than 200,000 acres in farms is 


0.51 + 1.968(0.0275), or [0.456, 0.564]. 


To find a 95% CI for the total number of counties with fewer than 200,000 acres 
in farms, we only need to multiply all quantities by N, so the point estimate is 
3078(0.51) = 1570, with standard error 3078 x SE(p) = 84.65 and 95% CI [1403, 
1736]. 
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Software packages such as SAS that calculate estimates for survey samples use 
the weight variable to find point estimates of means and totals. Here is output from 
SAS PROC SURVEYMEANS. The variable acres92 is the number of acres devoted 
to farms in 1992, and the variable /1200k takes on the value 1 if the county has less 
than 200,000 acres in farms and takes on the value | if the county has greater than 
200,000 acres in farms. The SAS code used to produce this output is given on the 
website in file example0210.sas. 


The SURVEYMEANS Procedure 


Data Summary 


Number of Observations 300 
Sum of Weights 3078 


Class Level Information 


Class 
Variable Levels Values 
1t200k 2 01 
Statistics 
Std Error Lower 95% Upper 95% 
Variable Mean of Mean CL for Mean CL for Mean Sum 
acres92 297897 18898 260706 335088 916927110 
1t200k=0 0.490000 0.027465 0.435951 0.544049 1508.220000 
1t200k=1 0.510000 0.027465 0.455951 0.564049 1569.780000 
Statistics 
Lower 95% Upper 95% 
Variable Std Dev CL for Sum CL for Sum 
acres92 58169381 802453859 1031400361 
1t200k=0 84.537220 1341.856696 1674.583304 
1t200k=1 84.537220 1403 .416696 1736.143304 


The weight for every observation in this sample is w; = 3078/300; note that the sum 
of the weights is 3078 (=N). = 


Lb 
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Sample Size Estimation 


An investigator often measures several variables and has a number of goals for a 
survey. Anyone designing an SRS must decide what amount of sampling error in the 
estimates is tolerable and must balance the precision of the estimates with the cost of 
the survey. Even though many variables may be measured, an investigator can often 
focus on one or two responses that are of primary interest in the survey, and use these 
for estimating a sample size. 

For a single response, follow these steps to estimate the sample size: 


1 Ask “What is expected of the sample, and how much precision do I need?” What 
are the consequences of the sample results? How much error is tolerable? If your 
survey measures the unemployment rate every month, you would like your esti- 
mates to be very precise indeed so that you can detect changes in unemployment 
rates from month to month. A preliminary investigation, however, often needs less 
precision than an ongoing survey. 


Instead of asking about required precision, many people ask, “What percentage of 
the population should I include in my sample?” This is usually the wrong question 
to be asking. Except in very small populations, precision is obtained through the 
absolute size of the sample, not the proportion of the population covered. We saw 
in Section 2.3 that the fpc, which is the only place that the population size N occurs 
in the variance formula, has little effect on the variance of the estimator in large 
populations. 


2 Find an equation relating the sample size n and your expectations of the sample. 
3 Estimate any unknown quantities and solve for n. 


4 If you are relatively new at designing surveys, you will find at this point that the 
sample size you calculated in step 3 is much larger than you can afford. Go back 
and adjust some of your expectations for the survey and try again. In some cases, 
you will find that you cannot even come close to the precision you need with the 
resources you have available; in that case, perhaps you should consider whether 
you should even conduct your study. 


Specify the Tolerable Error Only the investigators in the study can say how much 
precision is needed. The desired precision is often expressed in absolute 
terms, as 


P(ly—yyl Se) =1-a. 


The investigator must decide on reasonable values for @ and e; ¢ is called the margin of 
error in many surveys. For many surveys of people in which a proportion is measured, 
e=0.03 and a=0.05. 

Sometimes you would like to achieve a desired relative precision, controlling 
the CV in (2.13) rather than the absolute error. In that case, if yy #0 the precision 
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may be expressed as 


P( || <r) =1-a. 
Yu 


Find an Equation The simplest equation relating the precision and sample size comes 
from the confidence intervals in the previous section. To obtain absolute precision e, 
find a value of n that satisfies 


To solve this equation for n, we first find the sample size no that we would use for an 


SRSWR: 
aS \* 
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Then (see Exercise 9) the desired sample size is 


2 2 
no 24/25 
a = 7g = =) Se . (2.25) 
N e2 + 0/26 


n 


Of course, if ng > N we simply take a census withn=N. 

In surveys in which one of the main responses of interest is a proportion, it is often 
easiest to use that response in setting the sample size. For large populations, S* ~ 
p(1 — p), which attains its maximal value when p= 1/2. So using no = 1.967/(4e7) 
will result in a 95% CI with width at most 2e. 

To calculate a sample size to obtain a specified relative precision, substitute ry, 
for e in (2.24) and (2.25). This results in sample size 


2 2: 
yo 
i= one (2.26) 
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To achieve a specified relative precision, the sample size may be determined using 
only the ratio S/yy, the CV for a sample of size 1. 


Suppose we want to estimate the proportion of recipes in the Better Homes & Gardens 
New Cook Book that do not involve animal products. We plan to take an SRS of the 
N = 1251 test kitchen-tested recipes, and want to use a 95% CI with margin of error 
0.03. Then, 


EXAMPLE 2.12 
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The sample size ignoring the fpc is large compared with the population size, so in 
this case we would make the fpc adjustment and use 


. 14 |, 1087 esaauen a 
N 1251 


In Example 2.11, the fpc makes a difference in the sample size because N is 
only 1251. If N is large, however, typically no/N will be very small so that for large 
populations we usually have n © no. Thus, we need approximately the same sample 
size for any large population—whether that population has 10 million or | billion or 
100 billion units. 


Many public opinion polls specify using a sample size of about 1100. That number 
comes from rounding the value of mp in Example 2.11 up to the next hundred, and then 
noting that the population size is so large relative to the sample that the fpc should be 
ignored. For large populations, it is the size of the sample, not the proportion of the 
population that is sampled, that determines the precision. = 


Estimate unknown quantities. When interested in a proportion, we can use 1/4 as an 
upper bound for S*. For other quantities, S? must be estimated or guessed at. Some 
methods for estimating S? include: 


1 Use sample quantities obtained when pretesting your survey. This is probably the 
best method, as your pretest should be similar to the survey you take. A pilot 
sample, a small sample taken to provide information and guidance for the design 
of the main survey, can be used to estimate quantities needed for setting the sample 
size. 


2 Use previous studies or data available in the literature. You are rarely the first person 
in the world to study anything related to your investigation. You may be able to find 
estimates of variances that have been published in related studies, and use these as 
a starting point for estimating your sample size. But you have no control over the 
quality or design of those studies, and their estimates may be unreliable or may 
not apply for your study. In addition, estimates may change over time and vary in 
different geographic locations. 

Sometimes you can use the CV for a sample of size 1, the ratio of the standard 
deviation to the mean, in obtaining estimates of variability. The CV of a quantity is 
a measure of relative error, and tends to be more stable over time and location than 
the variance. If we take a random sample of houses for sale in the United States 
today, we will find that the variability in price will be much greater than if we had 
taken a similar survey in 1930. But the average price of a house has also increased 
from 1930 to today. We would probably find that the CV today is close to the CV 
in 1930. 


3 If nothing else is available, guess the variance. Sometimes a hypothesized distri- 
bution of the data will give us information about the variance. For example, if you 
believe the population to be normally distributed, you may not know what the vari- 
ance is, but you may have an idea of the range of the data. You could then estimate 
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FIGURE 2.5 
Plot of t0,025,n—18/4/N Vs. n, for two possible values of the standard deviation s. 
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S by range/4 or range/6, as approximately 95% of values from a normal population 
are within 2 standard deviations of the mean, and 99.7% of the values are within 3 
standard deviations of the mean. 


EXAMPLE 2.13 Before taking the sample of size 300 in Example 2.5, we took a pilot sample of size 
30 from the population. One county in the pilot sample of size 30 was missing the 
value of acres92; the sample standard deviation of the remaining 29 observations was 
519,085. Using this value, and a desired margin of error of 60,000, 


We took a sample of size 300 in case the estimated standard deviation from the pilot 
sample is too low. Also, we ignored the fpc in the sample size calculations; in most 
populations, the fpc will have little effect on the sample size. 

You may also view possible consequences of different sample sizes graphically. 
Figure 2.5 shows the value of f0.925,n—1 5/ ./n, for a range of sample sizes between 50 
and 700, and for two possible values of the standard deviation s. The plot shows that 
if we ignore the fpc and if the standard deviation is about 500,000, a sample of size 
300 will give a margin of error of about 60,000. = 


Determining the sample size is one of the early steps that must be taken in an 
investigation, and no magic formula will tell you the perfect sample size for your 
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investigation (you only know that in hindsight, after you have completed the study!). 
Choosing a sample size is somewhat like deciding how much food to take on a picnic. 
You have a rough idea of how many people will attend, but do not know how much 
food you should have brought until after the picnic is over. You also need to bring 
extra food to allow for unexpected happenings, such as 2-year-old Freddie feeding a 
bowl of potato salad to the ducks or cousin Ted bringing along some extra guests. But 
you do not want to bring too much extra food, or it will spoil and you will have wasted 
money. Of course, the more picnics you have organized, and the better acquainted you 
are with the picnic guests, the better you become at bringing the right amount of food. 
Itis comforting to know that the same is true of determining sample sizes—experience 
and knowledge about the population make you much better at designing surveys. 

The results in this section can give you some guidance in choosing the size of 
the sample, but the final decision is up to you. In general, the larger the sample, the 
smaller the sampling error. Remember, though, that in most surveys you also need 
to worry about nonsampling errors, and need to budget resources to control selection 
and measurement bias. In many cases, nonsampling errors are greater when a larger 
sample is taken—with a large sample, it is easy to introduce additional sources of 
error (for example, it becomes more difficult to control the quality of the interviewers 
or to follow up on nonrespondents) or to become more relaxed about selection bias. In 
Chapter 15, we shall revisit the issue of designing a sample with the aim of reducing 
nonsampling as well as sampling error. 


Systematic Sampling 


Sometimes systematic sampling is used as a proxy for simple random sampling, 
when no list of the population exists or when the list is in roughly random order. To 
obtain a systematic sample, choose a sample size n. If N/n is an integer, let k= N/n; 
otherwise, let k be the next integer after N/n. Then find a random integer R between 
1 and k, which determines the sample to be the units numbered R,R + k,R + 2k, 
etc. For example, to select a systematic sample of 45 students from the list of 45,000 
students at a university, the sampling interval k is 1000. Suppose the random integer 
we choose is 597. Then the students numbered 597, 1597, 2597, ... , 44,957 would 
be in the sample. 

If the list of students is ordered by randomly generated student identification 
numbers, we shall probably obtain a sample that will behave much like an SRS—it 
is unlikely that a person’s position in the list is associated with the characteristic of 
interest. However, systematic sampling is not the same as simple random sampling; it 
does not have the property that every possible group of n units has the same probability 
of being the sample. In the example above, it is impossible to have students 345 and 
346 both appear in the sample. Systematic sampling is technically a form of cluster 
sampling, as will be discussed in Chapter 5. 

If the population is in random order, the systematic sample will behave much like 
an SRS, and SRS methods can be used in the analysis. The population itself can be 
thought of as being mixed. In the quote at the beginning of this chapter, Sorensen 
reports that President Kennedy used to read a systematic sample of letters written 
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to him at the White House. This systematic sample most likely behaved much like 
a random sample. Note that Kennedy was well aware that the letters he read, while 
representative of letters written to the White House, were not at all representative of 
public opinion. 

Systematic sampling does not necessarily give a representative sample, though, 
if the listing of population units is in some periodic or cyclical order. If male and 
female names alternate in the list, for example, and k is even, the systematic sample 
will contain either all men or all women—this cannot be considered a representative 
sample. In ecological surveys done on agricultural land, a ridge-and-furrow topogra- 
phy may be present that would lead to a periodic pattern of vegetation. If a systematic 
sampling scheme follows the same cycle, the sample will not behave like an SRS. 

On the other hand, some populations are in increasing or decreasing order. A list 
of accounts receivable may be ordered from largest amount to smallest amount. In 
this case, estimates from the systematic sample may have smaller (but unestimable) 
variance than comparable estimates from the SRS. A systematic sample from an 
ordered list of accounts receivable is forced to contain some large amounts and some 
small amounts. It is possible for an SRS to contain all small amounts or all large 
amounts, so there may be more variability among the sample means of all possible 
SRSs than there is among the sample means of all possible systematic samples. 

In systematic sampling, we must still have a sampling frame and be careful when 
defining the target population. Sampling every 20th student to enter the library will 
not give a representative sample of the student body. Sampling every 10th person 
exiting an airplane, though, will probably give a representative sample of the persons 
on that flight. The sampling frame for the airplane passengers is not written down, 
but it exists all the same. 
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Randomization Theory Results for Simple 
Random Sampling*” 


In this section, we show that y is an unbiased estimator of y,,. We also calculate the 
variance of y given in (2.9), and show that the estimator in (2.11) is unbiased over 
repeated sampling. 

No distributional assumptions are made about the y,’s in order to ascertain that y is 
unbiased for estimating yy. We do not, for instance, assume that the y;’s are normally 
distributed with mean jj. In the randomization theory (also called design-based) 
approach to sampling, the y,’s are considered to be fixed but unknown numbers—the 
random variables used in randomization theory inference indicate which population 
units are in the sample. 

Let’s see how the randomization theory works for deriving properties of the sample 
mean in simple random sampling. As in Cornfield (1944), define 


Z= | 1 if unit 7 is in the sample (2.27) 


0 otherwise 


3An asterisk (*) indicates a section, chapter, or exercise that requires more mathematical background. 
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Then 
N 


y= o =) ae. (2.28) 


ieS i=1 


a|< 


The Z;’s are the only random variables in (2.28) because, according to randomization 
theory, the y;’s are fixed quantities. When we choose an SRS of 7 units out of the 
N units in the population, {Z),...,Zj} are identically distributed Bernoulli random 
variables with 


m; = P(Z; = 1) = P(select unit i in sample) = ni (2.29) 
and 
PZ =0y=41 [ica 
i= = i= e 
N 


The probability in (2.29) follows from the definition of an SRS. To see this, note that 
if unit i is in the sample, then the other n — 1 units in the sample must be chosen from 


the other NV — 1 units in the population. A total of Ge ' ) possible samples of size 


n— 1 may be drawn from a population of size N — 1, so 


N-1 
PZ; = 1) = number of samples including uniti _ 6 -1 ) a 
— ae number of possible samples ~ ( N ) ~~ N- 


n 


As a consequence of (2.29), 
ElZ)] = E[Z2] = — 
[Z)] = E[Z;] = N 
and 
N 


N N N 
23 Vi Ji nN Yi Yi = 
Ely]=E Z—\|= EZ; = = = Vy. 2.30 
[y] bp 4 > Zl = ya Dy TY (2.30) 


i= i=1 i=1 


This shows that y is an unbiased estimator of y,. Note that in (2.30), the random 
variables are Z|,...,Zn3 ¥1,.--, Yn are treated as constants. 

The variance of y is also calculated using properties of the random variables 
Z,,...Zy- Note that 


V2) = BIZ} — ela = -() = 2 (1-2). 


For i # j, 
E[Z;Z)] = P(Z; = 1 and Z; = 1) 
= PZ =1/Z, = 1)PZ = 1) 


= (F=a)(%): 


Because the population is finite, the Z;’s are not quite independent—if we know that 
unit 7 is in the sample, we do have a small amount of information about whether unit is 
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in the sample, reflected in the conditional probability P(Z; = 1 | Z; = 1). Consequently, 
for i A j, the covariance of Z; and Z; is: 


Cov (Z;, Z)) = E[Z;Z;] — E[ZJE[Z;] 
Srey a): 
=~ (I~ §) (F)- 


The negative covariance of Z; and Z; is the source of the fpc. The following deriva- 
tion shows how we can use the random variables Z,,...,Z, and the properties of 
covariances given in Appendix A to find V (): 


1 N N 
— Pre oe Ziyi, x Ziyj 
i=1 j=l 


V0) 


N 
=5 a» yiy Cov (Z;, Z) 


N N WN 


=5 Sy3ViZ) + YS iy Cov(Z,Z)) 


i=1 i=l j#i 


= 2-Day | DA- (Do) + De] 


i=1 i=] i=1 


=7(1 es vy (s») 
Sa) = 


To show that the estimator in (2.11) is an unbiased estimator of the variance, we 
need to show that E[s?] = 5S. The argument proceeds much like the previous one. 


Since S* = aan (vi -— ¥y)°/(N — 1), it makes sense when trying to find an unbiased 
estimator to find the expected value of )7 3.5 (vi— y)*, and then find the multiplicative 


1g 
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constant that will give the unbiasedness: 
E[ 201-3? ] =£[ toi 3) - G -Fw?| 
ieS ieS 
= E| 001 -3u - 26 -H | 


icS 


N 
= E| 720i - Fu] -2 VO) 
n " n 
= ethers Sy? 2 
= yO -B) (-=)s 
EDs N-n 
N N 
= (n— 1)S’. 
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Thus, 


1 
A EF 1 Yo 9] _ E[s7] = S2. 


iceS 


A Prediction Approach for Simple Random 


Sampling* 


Unless you have studied randomization theory in the design of experiments, the proofs 
in the preceding section probably seemed strange to you. The random variables in 
randomization theory are not concerned with the responses y;. The random variables 
Z1,.--,Zwy are indicator variables that tell us whether the ith unit is in the sample or 
not. In a design-based, or randomization-theory, approach to sampling inference, the 
only relationship between units sampled and units not sampled is that the nonsampled 
units could have been sampled had we used a different starting value for the random 
number generator. 

In Section 2.8 we found properties of the sample mean y using randomization 
theory: y;, y2,..., xy were considered to be fixed values, and y is unbiased because 
y=(1/N) ee Ziy; and E[Z;] = P(Z; = 1) =n/N.The only probabilities used in find- 
ing the expected value and variance of y are the probabilities that subsets of units are 
included in the sample. The quantity measured on unit i, y; can be anything: Whether 
y; is number of television sets owned, systolic blood pressure, or acreage devoted to 
soybeans, the properties of estimators depend exclusively on the joint distribution of 
the random variables {Z),...,Zy}. 

In your other statistics classes, you most likely learned a different approach to 
inference, an approach explained in Chapter 5 of Casella and Berger (2002). There, 
you had random variables {Y;} that followed some probability distribution, and the 
actual sample values were realizations of those random variables. Thus you assumed, 
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for example, that Y;, Y2,..., Y, were independent and identically distributed from a 
normal distribution with mean jz and variance o*, and used properties of indepen- 
dent random variables and the normal distribution to find expected values of various 
statistics. 

We can extend this approach to sampling by thinking of random variables Y,, 


Y2,...,Yy generated from some model. The actual values for the finite population, 
y1,Y2,---, wn, are one realization of the random variables. The joint probability distri- 
bution of Y, Y2,..., Yy supplies the link between units in the sample and units not in 


the sample in this model-based approach—a link that is missing in the randomization 
approach. Here, we sample {y;,i¢ S} and use these data to predict the unobserved 
values {y;,i ¢ S}. Thus, problems in finite population sampling may be thought of as 
prediction problems. 

In an SRS, a simple model to adopt is: 


Y,, Y2,..., Yy independent with Ey[Y;] = and Vy[¥;] = o. (2.31) 


The subscript M indicates that the expectation uses the model, not the randomization 
distribution used in Section 2.8. Here, jz and o represent unknown infinite population 
parameters, not the finite population quantities in Section 2.8. This model makes 
assumptions about the observations not in the sample; namely, that they have the 
same mean and variance as observations that are in the sample. We take a sample S 
and observe the values y; for i € S. That is, we see realizations of the random variables 
Y; for ie S. The other observations in the population {y;,i ¢ S} are also realizations 
of random variables, but we do not see those. The finite population total ¢ for our 
sample can be written as 


oe Yovt doy 


i=1 icS i¢gS 


and is one possible value that can be taken on by the random variable 


r= Sr =)o¥+ 0%. 


iceS igS 


We know the values {y;,i€ S}. To estimate ¢ for our sample, we need to predict 
values for the y,;’s not in the sample. This is where our model of the common mean 
2 comes in. The least squares estimator of yz from the sample is Ys = Se Fain, 
and this is the best linear unbiased predictor (under the model) of each unobserved 
random variable, so that 


t= Dnt i= "uaF bk 


ieS i¢gS ieS icS icS 


The estimator T is model-unbiased: if the model is correct, then the average of TT 
over repeated realizations of the population is 


Eyl? —T]= = Yewl¥i =a: ul Yi] = 0. 
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(Notice the difference between finding expectations under the model-based approach 
and under the design-based approach. In the model-based approach, the Y;’s are the 
random variables, and the sample has no information for calculating expected values, 
so we can take the sum )°;..5 outside of the expected value. In the design-based 
approach, the random variables are contained in the sample S.) 

The mean squared error is also calculated as the average squared deviation between 
the estimate and the finite population total. For any given realization of the random 
variables, the squared error is 


y= [- yy Sul 


ieS i=l 


Averaging this quantity over all possible realizations of the random variables gives 
the mean squared error under the model assumptions: 


N ai 
Eul(T — T)"] = Eu (2 yoN- ey 


2 


2 


N N 
=e C2 (2 = 1) me +o my 


igS 


Re (%-1) (Sx mu) + yoy -W-nu 


igS 


N 2 
a (- _ 1) no” +(N —n)o” 
=w(1-2)Z. 
N/ n 

In practice, if the model in (2.31) were adopted, you would estimate o7 by the 
sample variance s”. Thus the design-based approach and the model-based approach— 
with the model in (2.31)—lead to the same estimator of the population total and the 
same variance estimator. If a different model were adopted, however, the estimators 
might differ. We shall see in Chapter 4 how a design-based approach and a model- 
based approach can lead to different inferences. 

The design-based approach and the model-based approach with model (2.31) also 
lead to the same CI for the mean. These CIs have different interpretations, however. 
The design-based CI for y,; may be interpreted as follows: If we take all possible SRSs 
of size n from the finite population of size NV, and construct a 95% CI for each sample, 
we expect 95% of all the CIs constructed to include the true population value y,,. Thus, 
the design-based CI has a repeated sampling interpretation. Statistical inference is 
based on repeated sampling from the finite population. 
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To construct Cls in the model-based approach, we rely on a central limit theorem 
that states that if n/N is small, the standardized prediction error 


ieee 


V Eul(T — TY] 


converges to a standard normal distribution (Valliant et al., 2000, Section 2.5). For the 
model in (2.31), with Ey[(T — T)*] = N? (1 — n/N) o*/n, this central limit theorem 
says that for sufficiently large sample sizes, 


P|? ny (1 eters ny (1 as ~ 1 
Za /2 N “= = Zou/2 N ji y a. 


Substituting the sample standard deviation s for o, we get the large sample CI 


PENG re) 
<= Zoe/2. ( =) a > 
which has the same form as the design-based CI for the population total t. This model- 
based Cl is also interpreted using repeated sampling ideas, but in a different way than 
in Section 2.8. The design-based confidence level gives the expected proportion of 
Cls that will include the true finite population total t= ~ yi, from the set of all 
CIs that could be constructed by taking an SRS of size n from the finite population 
of fixed values {y1, y2,..., yy}. The model-based confidence level gives the expected 
proportion of CIs that will include the realization of the population total, from the set 
of all samples that could be generated from the model in (2.31). 

In the model-based approach, the probability model is proposed for all population 
units, whether in the sample or not. If the model assumptions are valid, model-based 
inference does not require random sampling—it is assumed that al/ units in the pop- 
ulation follow the assumed model, so it makes no difference which ones are chosen 
for the sample. Thus, model-based analyses can be used for nonprobability samples. 
The assumptions for model-based analysis are strong—for the model in (2.31), it is 
assumed that all random variables for the response of interest in the population are 
independent and have mean jz and variance o*. But we only observe the units in the 
sample, and cannot examine the assumption of whether the model holds for units not 
in the sample. If you take a sample of your friends to estimate the average amount of 
time students at your university spend studying, there is no reason to believe that the 
students not in your sample spend the same average amount of time studying as your 
friends do. As Box (1979, p. 202) said, “All models are wrong but some are useful.” 
If your model is deficient, inferences made using a model-based analysis may be 
seriously flawed. 


A Note on Notation. Many books (Cochran, 1977, for example) and journal articles 
use Y to represent the population total (¢ in this book), and Y to represent the finite 
population mean (our y,,). In this book, we reserve Y and T to represent random 
variables in a model-based approach. Our usage is consistent with other areas of 
statistics, in which capital letters near the end of the alphabet represent random vari- 
ables. However, you should be aware that notation in the survey sampling literature 
is not uniform. 


08 Chapter 2: Simple Probability Samples 


210 
When Should a Simple Random 
Sample Be Used? 


Simple random samples are usually easy to design and easy to analyze. But they are 
not the best design to use in the following situations: 


« Before taking an SRS, you should consider whether a survey sample is the best 
method for studying your research question. If you want to study whether a cer- 
tain brand of bath oil is an effective mosquito repellent, you should perform a 
controlled experiment, not take a survey. You should take a survey if you want to 
estimate how many people use the bath oil as a mosquito repellent, or if you want 
to estimate how many mosquitoes are in an area. 


=» You may not have a list of the observation units, or it may be expensive in terms 
of travel time to take an SRS. If interested in the proportion of mosquitoes in 
southwestern Wisconsin that carry an encephalitis virus, you cannot construct a 
sampling frame of the individual mosquitoes. You would need to sample different 
areas, and then examine some or all of the mosquitoes found in those areas, using a 
form of cluster sampling. Cluster sampling will be discussed in Chapters 5 and 6. 


=» You may have additional information that can be used to design a more cost- 
effective sampling scheme. In a survey to estimate the total number of mosquitoes 
in an area, an entomologist would know what terrain would be likely to have high 
mosquito density, and what areas would be likely to have low mosquito density, 
before any samples were taken. You would save effort in sampling by dividing 
the area into strata, groups of similar units, and then sampling plots within each 
stratum. (Stratified sampling will be discussed in Chapter 3.) 


You should use an SRS in these situations: 


= Little extra information is available that can be used when designing the survey. If 
your sampling frame is merely a list of university students’ names in alphabetical 
order and you have no additional information such as major or year, simple random 
or systematic sampling is probably the best probability sampling strategy. 


« Persons using the data insist on using SRS formulas, whether they are appropriate 
or not. Some persons will not be swayed from the belief that one should only 
estimate the mean by taking the average of the sample values—in that case, you 
should design a sample in which averaging the sample values is the right thing to 
do. SRSs are often recommended when sample evidence is used in legal actions; 
sometimes, when a more complicated sampling scheme is used, an opposing 
counsel will try to persuade the jury that the sample results are not valid. 


« The primary interest is in multivariate relationships such as regression equations 
that hold for the whole population, and there are no compelling reasons to take a 
stratified or cluster sample. Multivariate analyses can be done in complex samples, 
but they are much easier to perform and interpret in an SRS. 
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Chapter Summary 


In probability sampling, every possible subset from the population has a known proba- 
bility of being selected as the sample. These probabilities provide a basis for inference 
to the finite population. 

Simple random sampling without replacement is the simplest of all probability 
sampling methods. In an SRS, each subset of the population of size n has the same 
probability of being chosen as the sample. The probability that unit 7 of the population 
appears in the sample is 


ie 


N 


The sampling weight for each unit in the sample is 


each unit in the sample can be thought of as representing N/n units in the population. 
Estimators for an SRS are similar to those in your introductory statistics class, 


using Y= > j5 yi/nands?= Yogi — ¥)?/(n — 1): 


Population Quantity Estimator Standard Error of Estimator 
~ n\ s* 
Population total, t= i i= i= NY N (1 a =) = 
P pa: D wiyi = Ny nae 
i=l ieS 
7 ps Widi 
Populati iy=- eae y (1-<) = 
opulation mean, yy = = = ae 
P uN ON yw ? N) a 
ieS 
(1 — 
Population proportion, p p (1 = =)F (Pp) 
NJ n-1 


The only feature found in the estimators for without-replacement random samples 
that does not occur in with-replacement random samples is the finite population 
correction, (1 — n/N), which decreases the standard error when the sample size is 
large relative to the population size. In most surveys done in practice, the fpc is so 
close to one that it can be ignored. 

For “sufficiently large” sample sizes, an approximate 95% CI is given by 


estimate + 2/2 SE (estimate). 


The margin of error of an estimate is the half-width of the CI, that is, zy/2 x 
SE (estimate). 
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Key Terms 


Cluster sample: A probability sample in which each population unit belongs 
to a group, or cluster, and the clusters are sampled according to the sampling 
design. 


Coefficient of variation (CV): The CV of a statistic 6, where with E(6) # 0, is 
CV(6) =,/ V(6)/E(6). 
Confidence interval (CI): An interval estimate for a population quantity, for which 


the probability that the random interval contains the true value of the population 
quantity is known. 


Design-based inference: Inference for finite population characteristics based on the 
survey design, also called randomization inference. 


Finite population correction (fpc): A correction factor which, when multipled by 
the with-replacement variance, gives the without-replacement variance. For an SRS 
of size n from a population of size N, the fpc is 1 — n/N. 


Inclusion probability: ;= probability that unit i is included in the sample. 
Margin of error: Half of the width of a 95% CL. 


Model-based inference: Inference for finite population characteristics based on a 
model for the population, also called prediction inference. 


Probability sampling: Method of sampling in which every subset of the population 
has a known probability of being included in the sample. 


Sampling distribution: The probability distribution of a statistic generated by the 
sampling design. 


Sampling weight: Reciprocal of the inclusion probability; w; = 1/z;. 


Self-weighting sample: A sample in which all probabilities of inclusion z; are equal, 
so that all sampling weights w; are the same. 


Simple random sample with replacement (SRSWR): A probability sample in 
which the first unit is selected from the population with probability 1/N; then the 
unit is replaced and the second unit is selected from the set of N units with probability 
1/N, and so on until n units are selected. 


Simple random sample without replacement (SRS): An SRS of size n is a prob- 
ability sample in which any possible subset of n units from the population has the 
same probability (= n!(N — n)!/N!) of being the sample selected. 


Standard error (SE): The square root of the estimated variance of a statistic. 


Stratified sample: A probability sample in which population units are partitioned 
into strata, and then a probability sample of units is taken from each stratum. 


Systematic sample: A probability sample in which every kth unit in the population 
is selected to be in the sample, starting with a randomly chosen value R. Systematic 
sampling is a special case of cluster sampling. 
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Exercises 
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For Further Reading 


Stuart’s (1984) book The Ideas of Sampling is an intuitive, nontechnical introduction 
into the structure of probability sampling. He gives simple examples to illustrate 
the difference among the different probability sampling designs. For mathematical 
treatments of simple random sampling, see Raj (1968) and Cochran (1977). Levy and 
Lemeshow (2008) and S.K. Thompson (2002) are general references on sampling. 
M.E. Thompson (1997) and Fuller (2009) develop the mathematical theory of survey 
sampling and prove central limit theorems. Prediction- (model-) based inference is 
developed in Valliant et al. (2000) and Brewer (2002). 


Data files referenced in the exercises are provided and described on the website. 


A. Introductory Exercises 


Let N = 6 and n=3. For purposes of studying sampling distributions, assume that all 
population values are known. 

y4 = 133 ys = 190 yo = 175 
We are interested in y,,, the population mean. Two sampling plans are proposed. 


« Plan 1. Eight possible samples may be chosen. 


Sample Number Sample, S P(S) 


1 {1,3,5} 1/8 
2 {1,3,6} 1/8 
3 {1,4,5} 1/8 
4 {1,4,6} 1/8 
5 {2,3,5} 1/8 
6 {2,3,6} 1/8 
7 {2,4,5} 1/8 
8 {2,4,6} 1/8 


= Plan 2. Three possible samples may be chosen. 


Sample Number Sample, S P(S) 


1 {1,4,6} 1/4 
3 {2,3,6} 1/2 
3 {1,3,5} 1/4 


a What is the value of y,,? 


b_ Let y be the mean of the sample values. For each sampling plan, find 
@) Ely]; Gi) V[y]; (ii) Bias(y); Gv) MSE(). 


ce Which sampling plan do you think is better? Why? 
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For the population in Example 2.2, consider the following sampling scheme: 


S P(S) 
{1,3,5,6} 1/8 
{2,3,7,8} 1/4 


{1,4,6,8} 1/8 

{2,4,6,8} 3/8 

{4,5,7,8} 1/8 
a_ Find the probability of selection 7; for each unit i. 
b What is the sampling distribution of 7 = 8y? 
Each of the 10,000 shelves in a certain library is 300 cm long. To estimate how many 
books in the library need rebinding, a librarian takes a sample of 50 books using the 
following procedure: He first generates a random integer between 1 and 10,000 to 
select a shelf, and then generates a random number between 0 and 300 to select a 
location on that shelf. Thus, the pair of random numbers (2531, 25.4) would tell the 
librarian to include the book that is above the location 25.4 cm from the left end of 
shelf number 2531 in the sample. Does this procedure generate an SRS of the books 
in the library? Explain why, or why not. 
For the population in Example 2.2, find the sampling distribution of y for 
a anSRS of size 3 (without replacement) 
b an SRSWR of size 3 (with replacement). 
For each, draw the histogram of the sampling distribution of y. Which sampling 
distribution has the smaller variance, and why? 


An SRS of size 30 is taken from a population of size 100. The sample values are given 
below, and in the data file srs30.dat. 

85266386107 159153567 10143417 106141278129 

a What is the sampling weight for each unit in the sample? 

b_ Use the sampling weights to estimate the population total, r. 

ce Give a 95% CI for t. Does the fpc make a difference for this sample? 

A university has 807 faculty members. For each faculty member, the number of refer- 
eed publications was recorded. This number is not directly available on the database, 


so requires the investigator to examine each record separately. A frequency table for 
number of refereed publications is given below for an SRS of 50 faculty members. 


Refereed Publications | 0 1 2 3 4 5 6 7 8 9 10 


Faculty Members | 28 4 3 4 4 2 1 0 2 1 1 


a Plot the data using a histogram. Describe the shape of the data. 


b Estimate the mean number of publications per faculty member, and give the SE 
for your estimate. 


ce Do you think that y from (b) will be approximately normally distributed? Why or 
why not? 
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d Estimate the proportion of faculty members with no publications and give a 
95% CI. 


A letter in the December 1995 issue of Dell Champion Variety Puzzles stated: “I’ve 
noticed over the last several issues there have been no winners from the South in your 
contests. You always say that winners are picked at random, so does this mean you’re 
getting fewer entries from the South?” In response, the editors took a random sample 
of 1,000 entries from the last few contests, and found that 175 of those came from 
the South. 

a_ Find a 95% CI for the percentage of entries that come from the South. 

b According to Statistical Abstract of the United States, 30.9% of the U.S. population 
live in states that the editors considered to be in the South. Is there evidence from 
your CI that the percentage of entries from the South differs from the percentage 
of persons living in the South? 


Discuss whether an SRS would be appropriate for the following situations. What other 
designs might be used? 


a_ For an e-mail survey of students, you have a sampling frame that contains a list 
of e-mail addresses for all students. 


b You want to take a sample of patients of board-certified allergists. 
ce You want to estimate the percentage of topics in a medical website that have errors. 


d Acounty election official wants to assess the accuracy of the machine that counts 
the ballots by taking a sample of the paper ballots and comparing the estimated 
vote tallies for candidates from the sample to the machine counts. 


Show that if m9/N < 1, the value of n in (2.25) satisfies 


€ = 2a/2 (i=) 5 


Which of the following SRS designs will give the most precision for estimating a 
population mean? Assume that each population has the same value of the population 
variance S?. 


1. An SRS of size 400 from a population of size 4000 
2. An SRS of size 30 from a population of size 300 
3. An SRS of size 3000 from a population of size 300,000,000 


B. Working with Survey Data 


Mayr et al. (1994) took an SRS of 240 children who visited their pediatric outpatient 
clinic. They found the following frequency distribution for the age (in months) of free 
(unassisted) walking among the children: 


Age (months) | 9 10 11 12 #13 #14 #15 #16 «#17 «18 «19 20 


Number of Children | 13 35 44 69 36 24 7 3 2 5 1 1 


64 Chapter 2: Simple Probability Samples 


a_ Construct a histogram of the distribution of age at walking. Is the shape normally 
distributed? Do you think the sampling distribution of the sample average will be 
normally distributed? Why, or why not? 


b_ Find the mean, SE, and a 95% CI for the average age for onset of free walking. 


c Suppose the researchers wanted to do another study in a different region, and 
wanted a 95% CI for the mean age of onset of walking to have margin of error 
0.5. Using the estimated standard deviation for these data, what sample size would 
they need to take? 


The percentage of patients overdue for a vaccination is often of interest for a medical 
clinic Some clinics examine every record to determine that percentage; in a large 
practice, though, taking a census of the records can be time-consuming. Cullen (1994) 
took a sample of the 580 children served by an Auckland family practice to estimate 
the proportion of interest. 


a What sample size in an SRS (without replacement) would be necessary to estimate 
the proportion with 95% confidence and margin of error 0.10? 


b Cullen actually took an SRS with replacement of size 120, of whom 27 were not 
overdue for vaccination. Give a 95% CI for the proportion of children not overdue 
for vaccination. 


Einarsen et al. (1998) selected an SRS of 935 assistant nurses from a Norwegian 
county with 2700 assistant nurses. A total of 745 assistant nurses (80%) responded to 
the survey. 


a 20% of the 745 respondents reported that bullying occurred in their department. 
Using these respondents as the sample, give a 95% CI for the total number of 
nurses in the county who would report bullying in their department. 


b What assumptions must you make about the nonrespondents for the analysis in (a)? 
In 2005, the Statistical Society of Canada (SSC) had 864 members listed in the online 
directory. An SRS of 150 of the members was selected; the sex and employment 


category (industry, academic, government) was ascertained for each person in the 
SRS, with results in file ssc.dat. 


a What are the possible causes of selection bias in this sample? 


b_ Estimate the percentage of members who are female, and give a 95% CI for your 
estimate. 


ce Assuming that all members are listed in the online directory, estimate the total 
number of SSC members who are female, along with a 95% CI. 


The data set agsrs.dat also contains information on other variables. For each of the 
following quantities, plot the data, and estimate the population mean for that variable 
along with a 95% CI. 


a Number of acres devoted to farms in 1987 
b Number of farms, 1992 
c Number of farms with 1000 acres or more, 1992 


d Number of farms with 9 acres or fewer, 1992 
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21 
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The Internet site www.golfcourse.com listed 14,938 golf courses by state. It gave a 
variety of information about each course, including greens fees, course rating, par 
for the course, and facilities. Data from an SRS of 120 of the golf courses is in file 
golfsrs.dat. 


a__ Display the data in a histogram for the weekday greens fees for nine holes of golf. 
How would you describe the shape of the data? 


b_ Find the average weekday greens fee to play nine holes of golf, and give the SE 
for your estimate. 


Repeat Exercise 16 for the back tee yardage. 


For the data in golfsrs.dat, estimate the proportion of golf courses that have 18 holes, 
and give a 95% CI for the population proportion. 


The Special Census of Maricopa County, Arizona, gave 1995 populations for the 
following cities: 


City Population 
Buckeye 4,857 
Gilbert 59,338 
Gila Bend 1,724 
Phoenix 1,149,417 
Tempe 153,821 


Suppose that you are interested in estimating the percentage of persons who have 
been immunized against polio in each city and can take an SRS of persons. What 
should your sample size be in each of the 5 cities if you want the estimate from each 
city to have margin of error of 4 percentage points? For which cities does the finite 
population correction make a difference? 


C. Working with Theory 


Define the confidence interval procedure by 
CI(S) = [#s — 1.96 SE(¢s), ts + 1.96 SE(¢s)]. 


Using the method illustrated in Example 2.9, find the exact confidence level for a CI 
based on an SRS (without replacement) of size 4 from the population in Example 2.2. 
Does your confidence level equal 95%? 


One way of selecting an SRS is to assign a number to every unit in the population, 

then use a random number table to select units from the list. A page from a random 

number table is given in file rnt.dat. Explain why each of the following methods will 

or will not result in a simple random sample. 

a The population has 742 units, and we want to take an SRS of size 30. Divide the 
random digits into segments of size 3 and throw out any sequences of three digits 
not between 001 and 742. If a number occurs that has already been included in 
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the sample, ignore it. If we used this method with the first line of random numbers 
in rnt.dat, the sequence of three-digit numbers would be 


749 700 699 611 136 
We would include units 700, 699, 611, and 136 in the sample. 


For the situation in (a), when a random three-digit number is larger than 742, 
eliminate only the first digit and start the sequence with the next digit. With this 
procedure, the first five numbers would be 497, 006, 611, 136, and 264. 


Now suppose the population has 170 items. If we used the procedures described 
in (a) or (b), we would throw away many of the numbers from the list. To avoid 
this waste, divide every random three-digit number by 170 and use the rounded 
remainder as the unit in the sample. If the remainder is 0, use unit 170. For the 
sequence in the first row of the random number table, the numbers generated 
would be 


69 20 19 101 136 


Suppose the population has 200 items. Take two-digit sequences of random num- 
bers and put a decimal point in front of each to obtain the sequence 


0.74 0.97 0.00 0.69 0.96 


Then multiply each decimal by 200 to get the units for the sample (convert 0.00 
to 200): 


148 194 200 138 192 


A school has 20 homeroom classes; each homeroom class contains between 20 and 
40 students. To select a student for the sample, draw a random number between 
1 and 20; then select a student at random from the chosen class. Do not include 
duplicates in your sample. 


For the situation in the preceding question, select a random number between 1 
and 20 to choose a class. Then select a second random number between | and 
40. If the number corresponds to a student in the class then select that student; if 
the second random number is larger than the class size, then ignore this pair of 
random numbers and start again. As usual, eliminate duplicates from your list. 


Suppose we are interested in estimating the proportion p of a population that has a 
certain disease. As in Section 2.3 let y; = 1 if person i has the disease, and y; = 0 if 
person i does not have the disease. Then p= y. 


Show, using the definition in (2.13), that 


N 
If the population is large and the sampling fraction is small, so that N , ~ I, 


write (2.26) in terms of the CV for a sample of size 1. 
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FIGURE 2.6 
Histogram of the means of 1000 samples of size 300, taken with replacement from the data in 
Example 2.5. 
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Estimated Sampling Distribution of y 


b Suppose that the fpc ~ 1. Consider populations with p taking the successive values 
0.001, 0.005, 0.01, 0.05, 0.10, 0.30, 0.50, 0.70, 0.90, 0.95, 0.99, 0.995, 0.999. 


For each value of p, find the sample size needed to estimate the population pro- 
portion (a) with fixed margin of error 0.03, using (2.25), and (b) with relative error 
0.03p, using (2.26). What happens to the sample sizes for small values of p? 


(Requires probability.) In the population used in Example 2.5, 19 of the 3078 counties 
in the population are missing the value of acres92. What is the probability that an 
SRS of size 300 would have no missing data for that variable? 


Decision theoretic approach for sample size estimation. (Requires calculus.) In a 
decision theoretic approach, two functions are specified: 


L(n) = Loss or “cost” of a bad estimate 


C(n) = Cost of taking the sample 


Suppose that for some constants co, c;, and k, 


i} 


Lin) = kVGs) =k (1-2) = 


C(n) = co + cin. 
What sample size n minimizes the total cost L(n) + C(n)? 


(Requires computing.) If you have a large SRS, you can estimate the sampling dis- 
tribution of ys by repeatedly taking samples of size n with replacement from the 
list of sample values. A histogram of the means from 1000 samples of size 300 with 
replacement from the data in Example 2.5 is displayed in Figure 2.6; the shape may 
be slightly skewed, but still appears approximately normal. Would a sample of size 
100 from this population be sufficiently large to use the central limit theorem? Take 
500 samples with replacement of size 100 from the variable acres92 in agsrs.dat, and 
draw a histogram of the 500 means. The approach described in this exercise is known 
as the bootstrap (see Efron and Tibshirani, 1993); we discuss the bootstrap further 
in Section 9.3. 
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(Requires probability.) In an SRS, each possible subset of n units has probability 
1/ C of being chosen as the sample; in this chapter, we showed that this definition 


implies that each unit has probability n/N of appearing in the sample. The converse is 
not true, however. Show that the inclusion probability z; for each unit in a systematic 
sample is n/N, but that condition (2.7) is not met. 


(Requires probability.) A typical opinion poll surveys about 1000 adults. Suppose that 
the sampling frame contains 100 million adults including yourself, and that an SRS 
of 1000 adults is chosen from the frame. 


a_ What is the probability that you are selected to be in the sample? 


b Now suppose that 2000 such samples are selected, each sample selected indepen- 
dently of the others. What is the probability that you will not be in any of the 
samples? 

c How many samples must be selected for you to have a 0.5 probability of being in 
at least one sample? 


(Requires probability.) In an SRSWR, a population unit can appear in the sample 
anywhere between 0 and n times. Let 


Q; = number of times unit i appears in the sample, 
and 
N 
~ N 
c— i De Oiy;- 
i=1 
a Argue that the joint distribution of Q;, Q2,..., Oy is multinomial with 7 trials and 


Pi=pr=--- =py=1/N. 
b Using (a) and properties of the multinomial distribution, show that E [f]=t. 


ce Using (a) and properties of the multinomial distribution, find VIZ). 


(Requires probability.) Suppose you would like to take an SRS of size n from a list 

of N units, but do not know the population size N in advance. Consider the following 

procedure: 

a Set So={1,2,...,n}, so that the initial sample for consideration consists of the 
first n units on the list. 

b Fork=1,2,..., generate arandom number u; between 0 and 1. If u, >n/(n+k), 
then set S; equal to Sy_1. If uy <n/(n +k), then select one of the units in S,_, at 
random, and replace it by unit (n + k) to form Sx. 


Show that Sy_, from this procedure is an SRS of size n. HINT: Use induction. 


D. Projects and Activities 


Rectangles. This activity was suggested by Gnanadesikan et al. (1997). Figure 2.7 
contains a population of 100 rectangles. Your goal is to estimate the total area of all the 
rectangles by taking a sample of 10 rectangles. Keep your results from this exercise; 
you will use them again in later chapters. 
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FIGURE 2.7 
Population of 100 rectangles 
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a Select a purposive sample of 10 rectangles that you think will be representative of 
the population of 100 rectangles. Record the area (number of small squares) for 
each rectangle in your sample. Use your sample to estimate the total area. How 
did you choose the rectangles for your sample? 

b Find the sample variance for your purposive sample of 10 rectangles from part 
(a), and use (2.22) to form an interval estimate for the total area f. 

ce Now take an SRS of 10 rectangles. Use your SRS to estimate the total area of all 
100 rectangles, and find a 95% CI for the total area. 


d Compare your intervals with those of other students in the class. What percentage 
of the intervals from part (b) include the true total area of 3079? What about the 
CIs from part (c)? 
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Mutual funds. The websites of companies such as Fidelity (www.fidelity.com), Van- 
guard (www.vanguard.com), and T. Rowe Price (www.troweprice.com) list the mutual 
funds of those companies, along with some statistics about the performance of those 
funds. Take an SRS of 25 mutual funds from one of these companies. Describe how 
you selected the SRS. Find the mean and a 95% CI for the mean of a variable you 
are interested in, such as daily percentage change, or 1-year performance, or length 
of time the fund has existed. 


Baseball data. This activity is due to Jenifer Boshes, who also compiled the data from 
Forman (2004) and publicly available salary information. The data file baseball.dat 
contains statistics on 797 baseball players from the rosters of all major league teams 
in November, 2004. In this exercise (which will be continued in later chapters), you 
will treat the file baseball.dat as a population and draw samples from it using different 
sampling designs. 
a Take an SRS of 150 players from the file. Describe how you selected the SRS. Save 
your data set for use in future exercises (if you are selecting it using SAS PROC 
SURVEYSELECT, you can recreate the data set by using seed = number). 


b Calculate logsal = In(salary). Construct a histogram of the variables salary and 
logsal from your SRS. Does the distribution of salary appear approximately 
normal? What about logsal? 


ce Find the mean of the variable logsal, and give a 95% CI. 


d_ Estimate the proportion of players in the data set who are pitchers, and give a 
95% CI. 


e Since you have the full data file for the population, you can find the true mean 
and proportion for the population. Do your CIs in (c) and (d) contain the true 
population values? 


Online bookstore. The website amazon.com can be used to obtain populations of 
books, CDs, and other wares. 


a_ In the books search window, type in a genre you like, such as mystery or sports; 
you may want to narrow your search by selecting a subcategory since an upper 
bound is placed on the number of books that can be displayed. Choose a genre 
with at least 20 pages of listings. The list of books forms your population. 


b What is your target population? What is the population size, N? 


ce Take an SRS of 50 books from your population. Describe how you selected the 
SRS, and record the amount of time you spent taking the sample and collecting 
the data. 


d Record the following information for each book in your SRS: price, number of 
pages, and whether the book is paperback or hardback. 


e Give a point estimate and a 95% CI for the mean price of books in the genre you 
selected. 


f Give a point estimate and a 95% CI for the mean number of pages for books in 
the genre you selected. 


g_ Explain, to a person who knows no statistics, what your estimates and CIs mean. 
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Take a small SRS of something you’re interested in. Explain what it is you decide 
to study and carefully describe how you chose your SRS (give the random numbers 
generated and explain how you translated them into observations), report your data, 
and give a point estimate and the SE for the quantity or quantities of interest. 

The data collection for this exercise should not take a great deal of effort, as you 
are surrounded by things waiting to be sampled. Some examples: mutual fund data in 
the financial section of today’s newspaper, actual weights of 1-pound bags of carrots 
at the supermarket, or the cost of a used dining room table from an online classified 
advertisement site. 


Estimating the size of an audience. A common method for estimating the size of an 

audience is to take an SRS of n of the N rows in an auditorium, count the number of 

people in each of the selected rows, then multiply the total number of people in your 

sample by N/n. 

a_ Why is it important to take an SRS instead of a convenience sample of the first 
10 rows? 


b Go toa performance or a lecture, and count the number of rows in the auditorium. 
Take an SRS of 10 or 20 rows, count the number of people in each row, and 
estimate the number of people in the audience using this method. Give a 95% CI. 


Forest data. The data in file forest.dat are from kdd.ics.uci.edu/databases/covertype/ 
covertype.data.html (Blackard, 1998). They consist of a subset of the measurements 
from 581,012 30 x 30m cells from Region 2 of the U.S. Forest Service Resource 
Information System. The original data were used in a data mining application, pre- 
dicting forest cover type from covariates. Data-mining methods are often used to 
explore relationships in very large data sets; in many cases, the data sets are so large 
that statistical software packages cannot analyze them. Many data-mining problems, 
however, can be alternatively approached by analyzing probability samples from the 
population. In these exercises, we treat forest.dat as a population. 


a_ Select an SRS of size 2000 from the 581,012 records. Keep this sample, or the 
random number seed you used to generate the sample, for later use in Chapter 4. 


b Using your SRS, estimate the percentage of cells in each of the 7 forest cover 
types, along with 95% CIs. 


c Estimate the average elevation in the population, with 95% CI. 


IPUMS data. This exercise is designed for the Integrated Public Use Microdata Series 
(IPUMS), available online at www.ipums.org/usa/ (Ruggles et al., 2004). The IPUMS 
site hosts a collection of samples from the U.S. Decennial Census and American Com- 
munity Survey. In the following exercises, we use a self-weighting sample selected 
from the 1980 Decennial Census sample, selected using the “Small Sample Density” 
option in the data extract tool. The data are in file ipums.dat. We treat these data as a 
population. 


a The variable inctot is total personal income from all sources. Note from the doc- 
umentation for the variable that it is “topcoded” at $75,000 to protect the confi- 
dentiality of the respondents. What effect does the topcoding have on estimates 
from the file? 
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b Draw a pilot sample (SRS) of size 50 from the IPUMS population. Use the sample 
variance you get for inctot to determine the sample size you need to estimate the 
average of inctot with a margin of error of 700 or less. 


ce Take an SRS of your desired sample size from the population. Estimate the total 
income for the population, and give a 95% CI. Make sure you save the seed number 
you use in SAS PROC SURVEYSELECT or other software so you can recreate 
this sample in later chapters. 


E. SURVEY Exercises 


The program SURVEY (Chang et al., 1992) allows you to draw samples from a 
hypothetical population to learn about cable TV practices. The website contains the 
programs and a description of the population, and gives exercises and activities for 
using the population. The SURVEY exercises continue in subsequent chapters. 


4.1 


Stratified Sampling 


One of the things she [Mama] taught me should be obvious to everyone, but | still find a lot of cooks 
who haven't figured it out yet. Put the food on first that takes the longest to cook. 


—Pearl Bailey, Pearl's Kitchen 


What Is Stratified Sampling? 


EXAMPLE 3.1 


The Federal Deposit Insurance Corporation (FDIC) was created in 1933 by the U.S. 
Congress to supervise banks; it insures deposits at member banks up to a specified 
limit. When a bank fails, the FDIC acquires the assets from that bank and uses them 
to help pay the insured depositors. Valuing the assets is time-consuming, so the FDIC 
selects a sample of the assets in order to estimate the total amount recovered from 
financial institutions (Chapman, 2005). The assets from failed institutions fall into 
several types: (1) consumer loans, (2) commercial loans, (3) securities, (4) real estate 
mortgages, (5) other owned real estate, (6) other assets, and (7) net investments in 
subsidiaries. A simple random sample (SRS) of assets may result in an imprecise 
estimate of the total amount recovered. Consumer loans tend to be much smaller on 
average than assets in the other classes, so the sample variance from an SRS can be 
very large. In addition, an SRS might contain no assets from one or more of the asset 
types; if category (2) assets tend to have the most monetary value and the sample 
chosen has no assets from category (2), that sample may result in an estimate of total 
assets that is too small. It would be desirable to have a method for sampling that 
prevents samples that we know would produce bad estimates, and that increases the 
precision of the estimators. Stratified sampling can accomplish these goals. = 


Often, we have supplementary information that can help us design our sample. For 
example, we would know before undertaking an income survey that men generally 
earn more than women, that New York City residents pay more for housing than 
residents of Des Moines, or that rural residents shop for groceries less frequently than 


if 
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urban residents. The FDIC has information on the type of each asset, which is related 
to the value of the asset. 

If the variable we are interested in takes on different mean values in different 
subpopulations, we may be able to obtain more precise estimates of population quan- 
tities by taking a stratified random sample. The word stratify comes from Latin words 
meaning “to make layers”; we divide the population into H subpopulations, called 
strata. The strata do not overlap, and they constitute the whole population so that 
each sampling unit belongs to exactly one stratum. We draw an independent probabil- 
ity sample from each stratum, then pool the information to obtain overall population 
estimates. 

We use stratified sampling for one or more of the following reasons: 


1 We want to be protected from the possibility of obtaining a really bad sample. 
When taking an SRS of size 100 from a population of 1000 male and 1000 female 
students, obtaining a sample with no or very few males is theoretically possi- 
ble, although such a sample is not likely to occur. Most people would not con- 
sider such a sample to be representative of the population and would worry that 
men and women might respond differently on the item of interest. In a strati- 
fied sample, you can take an SRS of 50 males and an independent SRS of 50 
females, guaranteeing that the proportion of males in the sample is the same as 
that in the population. With this design, a sample with no or few males cannot be 
selected. 


2 We may want data of known precision for subgroups of the population. These sub- 
groups should be the strata. McIlwee and Robinson (1992) sampled graduates from 
electrical and mechanical engineering programs at public universities in southern 
California. They were interested in comparing the educational and workforce expe- 
riences of male and female graduates, so they stratified their sampling frame by 
gender and took separate random samples of male and female graduates. Because 
there were many more male than female graduates, they sampled a higher fraction 
of female graduates than male graduates in order to obtain comparable precisions 
for the two groups. 


3 A stratified sample may be more convenient to administer and may result in a 
lower cost for the survey. For example, sampling frames may be constructed dif- 
ferently in different strata, or different sampling designs or field procedures may 
be used. In a survey of businesses, an Internet survey might be used for large firms 
while a mail or telephone survey is used for small firms. In other surveys, a differ- 
ent procedure may be used for sampling households in urban strata than in rural 
strata. 


a 


Stratified sampling often gives more precise (having lower variance) estimates for 
population means and totals. Persons of different ages tend to have different blood 
pressures, so in a blood pressure study it would be helpful to stratify by age groups. 
If studying the concentration of plants in an area, one would stratify by type of 
terrain; marshes would have different plants than woodlands. Stratification works 
for lowering the variance because the variance within each stratum is often lower 
than the variance in the whole population. Prior knowledge can be used to save 
money in the sampling procedure. 
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EXAMPLE 3.2 _ Refer to Example 2.5, in which we took an SRS to estimate the average number of 
farm acres per county. In Example 2.5, we noted that even though we scrupulously 
generated a random sample, some areas were overrepresented, and others not repre- 
sented at all. Taking a stratified sample can provide some balance in the sample on 
the stratifying variable. 

The SRS in Example 2.5 exhibited a wide range of values for y;, the number of 
acres devoted to farms in county i in 1992. You might conjecture that part of the large 
variability arises because counties in the western United States are larger, and thus 
tend to have larger values of y, than counties in the eastern United States. 

For this example, we use the four census regions of the United States—Northeast, 
North Central, South, and West—as strata. The SRS in Example 2.5 sampled about 
10% of the population; to be able to compare the results of the stratified sample with 
the SRS, we also sample about 10% of the counties in each stratum. (We discuss other 
stratified sampling designs later in the chapter.) 


Number of Number of 
Counties Counties 

Stratum in Stratum in Sample 
Northeast 220 21 
North Central 1054 103 
South 1382 135 
West 422 41 
Total 3078 300 


We select four separate SRSs, one from each of the four strata. To select the SRS 
from the Northeast stratum, we number the counties in that stratum from | to 220, 
and select 21 numbers randomly from {1, ..., 220}. We follow a similar procedure 
for the other three strata, selecting 103 counties at random from the 1054 in the North 
Central region, 135 counties from the 1382 in the South, and 41 counties from the 
422 in the West. The four SRSs are selected independently: Knowing which counties 
are in the sample from the Northeast tells us nothing about which counties are in the 
sample from the South. 

The data sampled from all four strata are in data file agstrat.dat. A boxplot, showing 
the data for each stratum, is in Figure 3.1. Summary statistics for each stratum are 
given below: 


Region Sample Size Average Variance 

Northeast 21 97,629.8 7,647,472,708 
North Central 103 300,504.2 29,618,183,543 
South 135 211,315.0 53,587,487,856 
West 4] 662,295.5 396, 185,950,266 


Since we took an SRS in each stratum, we can use (2.15) and (2.17) to estimate 
the population quantities for each stratum. We use 


(220)(97,629.81) = 21,478,558.2. 
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FIGURE 3.1 

The boxplot of data from Example 3.2. The thick line for each region is the median of the 
sample data from that region; the other horizontal lines in the boxes are the 25th and 75th 
percentiles. The Northeast region has a relatively small median and small variance; the West 
region, however, has a much higher median and variance. The distribution of farm acreage 
appears to be positively skewed in each of the regions. 
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to estimate the total number of acres devoted to farms in the Northeast, with estimated 
variance 


= 1.594316 x 10". 


21 \ 7,647,472,708 
(220)? ( 1 
220 21 
The following table gives estimates of the total number of farm acres and estimated 


variance of the total for each of the four strata: 


Estimated Total Estimated Variance 
Stratum of Farm Acres of Total 
Northeast 21,478,558.2 1.59432 x 10!3 
North Central 316,731,379.4 2.88232 x 104 
South 292,037,390.8 6.84076 x 10!4 
West 279,488,706. 1 1.55365 x 10! 
Total 909,736,034.4 2.5419 x 105 


We can estimate the total number of acres devoted to farming in the United States 
in 1992 by adding the totals for each stratum; as sampling was done independently in 
each stratum, the variance of the U.S. total is the sum of the variances of the stratum 
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totals. Thus we estimate the total number of acres devoted to farming as 909,736,034, 
with standard error (SE) /2.5419 x 10/5 = 50,417,248. We would estimate the 
average number of acres devoted to farming per county as 909,736,034/3078 = 
295,560.7649, with standard error 50,417,248/3078 = 16,379.87. 

For comparison, the estimate of the population total in Example 2.5, using an 
SRS of size 300, was 916,927,110, with standard error 58,169,381. For this example, 
stratified sampling ensures that each region of the United States is represented in 
the sample, and produces an estimate with smaller standard error than an SRS with 
the same number of observations. The sample variance in Example 2.5 was s* = 
1.1872 x 10!!. Only the West had sample variance larger than s?; the sample variance 
in the Northeast was only 7.647 x 10°. 

Observations within many strata tend to be more homogeneous than observations 
in the population as a whole, and the reduction in variance in the individual strata 
often leads to a reduced variance for the population estimate. In this example, the 
relative gain from stratification can be estimated by the ratio 


estimated variance from stratified sample, with n = 300 _ 2.5419 x 10 


= = 0.75. 
estimated variance from SRS, with n = 300 3.3837 x 10!5 


If these figures were the population variances, we would expect that we would need 
only (300)(0.75) = 225 observations with a stratified sample to obtain the same pre- 
cision as from an SRS of 300 observations. 

Of course, no law says that you must sample the same fraction of observations in 
every stratum. In this example, there is far more variability from county to county in 
the western region; if acres devoted to farming were the primary variable of interest, 
you would reduce the variance of the estimated total even further by taking a higher 
sampling fraction in the western region than in the other regions. You will explore an 
alternative sampling design in Exercise 12. a 
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We divide the population of N sampling units into H “layers” or strata, with Nj, 
sampling units in stratum h. For stratified sampling to work, we must know the values 
of N,,N2,...,Ny, and must have 


Ni +No+---+Ne =N, 


where N is the total number of units in the entire population. 

In stratified random sampling, the simplest form of stratified sampling, we 
independently take an SRS from each stratum, so that n, observations are randomly 
selected from the N;, population units in stratum h. Define S), to be the set of np, units 
in the SRS for stratum h. The total sample size isn = nj +2 +---+n4. 
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Notation for Stratification: | The population quantities are: 


Yj = value of jth unit in stratum h 
Nn 
t= a Yaj = population total in stratum h 
j=l 


H 
— tn = population total 
h=1 
Nn 
dh 
= j=1 : ; 
Yau = : aa population mean in stratum h 
h 
H Nn 
dU 
_ t h=1 j=1 ; 
yy = — = ——— = overall population mean 
N N 
Nn = 2 
(nj — Yau) 
So = —————— = population variance in stratum h 
= Wei Tee 


Corresponding quantities for the sample, using SRS estimators within each stratum, 
are: 


Ya = — 


JESh 
a Np _ 
t= — er = NaYn 
Np ? 
JESh 
= \2 
2 (Yay — Yn) 
phos os ae ae 
jes, 


Suppose we only sampled the /th stratum. In effect, we have a population of 
Np units and take an SRS of 1; units. Then we would estimate y,y by y,, and t, by 
th = Nnyn. The population total is t = Ban tn, SO We estimate t by 


H A 
ee Ne (3.1) 
h=1 h=1 


To estimate yy, then, we use 


= tetr Nn = 
Foe = Fr = Day Tee (3.2) 


This is a weighted average of the sample stratum averages; y;, is multiplied by N;,/N, 
the proportion of the population units in stratum h. To use stratified sampling, the 
sizes or relative sizes of the strata must be known. 


EXAMPLE 3.3 
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The properties of these estimators follow directly from the properties of SRS 


estimators: 


Unbiasedness. y,,, and fer are unbiased estimators of yy and t. An SRS is taken 
in each stratum, so (2.30) implies that E[y,] = y,y and consequently 


H H H 
Nh - Ni po Nn - 7 
E a VA => —F h) = rine — 3 
S| a [Val] ey ow yu 
h=1 h=1 
Variance of the estimators. Since we are sampling independently from the strata, 
and we know V(i;,) from the SRS theory, the properties of expected value in 
Section A.2 and (2.16) imply that 


rl 7 H a H Nh 8 
V (str) = > Vth) = > 1— NA N,, ith (3.3) 


h=1 h=1 


Standard errors for stratified samples. We can obtain an unbiased estimator of 
V(isr) by substituting the sample estimators se for the population parameters 5S. 
Note that in order to estimate the variances, we need to sample at least two units 
from each stratum. 


AA = Nh Pe 
Vise) = D> (1- — ) np (3.4) 


N n 
h=l h h 


6a) = 10a => (1M) (ME) 3.5) 
str7 — N2 str? — N, N nn * 


h=1 


As always, the standard error of an estimator is the square root of the estimated 
variance: SE(Vstr) = +/ VOstr)- 
Confidence intervals for stratified samples. If either (1) the sample sizes within 


each stratum are large, or (2) the sampling design has a large number of strata, an 
approximate 100(1 — a)% confidence interval (CI) for the population mean yy is 


Ystr L Za/2 SE (str) 


The central limit theorem used for constructing this CI is stated in Krewski and 
Rao (1981). Some survey software packages use the percentile of a ¢ distribution 
with n — H degrees of freedom (df) rather than the percentile of the normal 
distribution. 


Siniff and Skoog (1964) used stratified random sampling to estimate the size of the 
Nelchina herd of Alaska caribou in February of 1962. In January and early February, 
several sampling techniques were field-tested. The field tests told the investigators 
that several of the proposed sampling units, such as equal-flying-time sampling units, 
were difficult to implement in practice, and that an equal-area sampling unit of 4 
square miles (mi*) would work well for the survey. The biologists used preliminary 
estimates of caribou densities to divide the area of interest into six strata; each stratum 
was then divided into a grid of 4-mi? sampling units. Stratum A, for example, contained 
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TABLE 3.1 
Spreadsheet for Calculations in Example 3.3 


A B Cc D E F G 
1 | Stratum | M, | mp vn 2 th = Navn (: = =) nytt 
ae 400 | 98 | 241 5,575 9,640 6,872,040.82 
3 | B 30 10 | 25.6 4,064 768 243,840.00 
4 | Cc 61 37 | 267.6 | 347,556 | 16,323.6 13,751,945.51 
5 | B 18 6 | 179.0 | 22,798 5909 820,728.00 
6 | E 70 | 39 | 293.7 | 123,578 | 20,559 6,876,006.67 
7 | F 120 | 21 932 9,795 3,984 5,541,171.43 
8 | total 211 54,496.6 34, 105,732.43 
9 | vtotal 5,840.01 


N, = 400 sampling units; 1, = 98 of these were randomly selected to be in the survey. 
The following data were reported: 


Stratum Nh Ny Yh se 
A 400 98 24.1 5,575 
B 30 10 25.6 4,064 
C 61 37 267.6 347,556 
D 18 6 179.0 22,798 
E 70 39 293.7 123,578 
F 120 21 33.2 9,795 


The spreadsheet shown in Table 3.1 displays the calculations for finding the strati- 
fied sampling estimates. The estimated total number of caribou is 54,497 with standard 
error 5,840. An approximate 95% CI for the total number of caribou is 


54,497 = 1.96(5840) = [43,051, 65,943]. 


Of course, this CI only reflects the uncertainty due to sampling error; if the field 
procedure for counting caribou tends to miss animals, then the entire CI will be too 
low. sm 


Stratified Sampling for Proportions As we observed in Section 2.3, a proportion is a 


mean of a variable that takes on values 0 and 1. To make inferences about proportions, 
Ny 


we simply use the results in (3.1)—(3.5), with y,, = p;, and - — Put — Pp). 
Ah — 
Then, 
H 
x Nh A 
Paw = >> Pn (3.6) 


EXAMPLE 3.4 
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and 


H dA” : 

V(p = y Nh Np Pr — Pn) 
ee ; : 3.7 
i 3 m+) (=) ny —1 ( ) 


Estimating the total number of population units having a specified characteristic is 
similar: 


H 
ter = Y> Mibn, 


h=1 


so the estimated total number of population units with the characteristic is the sum of 
the estimated totals in each stratum. Similarly, V (és) = N?V (Bstr)- 


The American Council of Learned Societies (ACLS) used a stratified random sample 
of selected ACLS societies in seven disciplines to study publication patterns and com- 
puter and library use among scholars who belong to one of the member organizations 
of the ACLS (Morton and Price, 1989). The data are shown in Table 3.2. 

Ignoring the nonresponse for now (we'll return to the nonresponse in Exercise 7 of 
Chapter 8) and supposing there are no duplicate memberships, let’s use the stratified 
sample to estimate the percentage and number of respondents of the major societies 
in those seven disciplines that are female. Here, let Nj, be the membership figures, 
and let n;, be the number of valid surveys. Thus, 

7 


A Nh» 9100 9000 
Ps = » 


i 8 4 0.26 = 0.2465 


N 44,000 ‘44,000 


h=1 


and 


u Nn Nn Ke — Pn) 
SE(Pstr) = 1 ~ = 0.0071. 
ie) »( wn) (¥) nm — 1 


TABLE 3.2 


Data from ACLS Survey 
Number Valid Female 
Discipline Membership Mailed Returns Members (%) 
Literature 9,100 915 636 38 
Classics 1,950 633 451 27 
Philosophy 5,500 658 481 18 
History 10,850 855 611 19 
Linguistics 2,100 667 493 36 
Political Science 5,500 833 575 13 
Sociology 9,000 824 588 26 


Totals 44,000 5,385 3,835 


3.3 


82 Chapter 3: Stratified Sampling 


The estimated total number of female members in the societies is fg; = 44,000 x 
(0.2465) = 10,847, with SE(f,) = 44,000 x (0.0071) = 312. ou 


Sampling Weights in Stratified Random 


Sampling 


We introduced the notion of sampling weight, w; = 1/z;, in Section 2.4. For an 
SRS, the sampling weight for each observation is the same since all of the inclusion 
probabilities z; are the same. In stratified sampling, however we may have different 
inclusion probabilities in different strata so that the weights may be unequal for some 
stratified sampling designs. 

The stratified sampling estimator 7, can be expressed as a weighted sum of the 
individual sampling units: Using (3.1), 


H A 

A se Nh 

tr = ) Nnyn _ ) ) —Yhj- 
h=l Nn 


h=1 jEeSp, 


The estimator of the population total in stratified sampling may thus be written as 


A 
iste = >». > WhjYhj> (3.8) 


h=1 jeSp, 


where the sampling weight for unit j of stratum h is wj; = (N;,/np). The sampling 
weight can again be thought of as the number of units in the population represented 
by the sample member y,;. If the population has 1600 men and 400 women, and the 
stratified sample design specifies sampling 200 men and 200 women, then each man 
in the sample has weight 8 and each woman has weight 2. Each woman in the sample 
represents herself and another woman not selected to be in the sample, and each man 
represents himself and seven other men not in the sample. Note that the probability 
of including unit j of stratum / in the sample is 7; = n,/Np, the sampling fraction 
in stratum h. Thus, as before, the sampling weight is simply the reciprocal of the 
inclusion probability: 
1 
Wh = —- (3.9) 
Thj 
The sum of the sampling weights in stratified random sampling equals the popula- 
tion size NV; each sampled unit “represents” a certain number of units in the population, 
so the whole sample “represents” the whole population. In a stratified random sample, 
the population mean is thus estimated by 


H 
) ) Wrj Yhj 
h=1 jeSp 

H 


soe 


h=1 jeSp 


(3.10) 


Ystr = 
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FIGURE 3.2 
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A stratified random sample from a population with N = 500. The top row is Stratum 1; rows 
2-4 comprise Stratum 2; the bottom 21 rows are Stratum 3. Units in the sample are shaded. 
Stratum 1 has N,; = 20 and n, = 10, so the sampling weight for each unit in Stratum 1 is 2. 
For Stratum 2, Ny = 60, m2 = 12, and the sampling weight for each unit in Stratum 2 is 5. For 


Stratum 3, N; = 420, n3 = 20, and the sampling weight for each unit in Stratum 3 is 21. 


Stratum 1 


Stratum 2 


Stratum 3 


Figure 3.2 illustrates a stratified random sample for a population with 3 strata. 
The sampling weights are smallest in Stratum 1, where half of the stratum population 


units are sampled. 


A stratified sample is self-weighting if the sampling fraction n;/N), is the same for 
each stratum. In that case, the sampling weight for each observation is N /n, exactly the 
same as in an SRS. The variance of a stratified random sample, however, depends on 
the stratification, so (3.4) must be used to estimate the variance of fctr. Equation (3.4) 
requires that you calculate the variance separately within each stratum; the weights 


do not tell you the stratum membership of the observations. 


For the caribou survey in Example 3.3, the weights are 


Stratum Np Np Why 
A 400 98 4.08 
B 30 10 3.00 
C 61 37 1.65 
D 18 6 3.00 
E 70 39 1.79 
F 120 21 5.71 


EXAMPLE 3.6 
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In stratum A, each sampling unit of 4 mi? represents 4.08 sampling units in the 
stratum (including itself); in stratum B, a sampling unit in the sample represents itself 
and 2 other sampling units that are not in the sample. To estimate the population total, 
then, a new variable of weights could be constructed. This variable would contain the 
value 4.08 for every observation in stratum A, 3.00 for every observation in stratum 
B,andsoon. 


The sample in Example 3.2 was designed so that each county in the United States 
would have approximately the same probability of appearing in the sample. To esti- 
mate the total number of acres devoted to agriculture in the United States, we create 
the variable strwt in file agstrat.dat with the sampling weights; it contains the value 
220/21 for counties in the Northeast stratum, 1054/103 for the North Central coun- 
ties, 1382/135 for the South counties, and 422/41 for the West counties. We can 
use (3.8) to estimate the population total by forming a new column containing the 
product of variables strwt and acres92, then calculating the sum of the new col- 
umn. In doing so, we calculate tstr = 909,736,035, the same (up to roundoff error) 
estimate as obtained in Example 3.2. Note that even though this sample is approx- 
imately self-weighting, it is not exactly self-weighting because the stratum sample 
sizes must be integers. When calculating estimates, use the exact weights from each 
stratum. 

The variable strwt can be used to estimate population means or totals for every 
variable measured in the sample, and most computer packages for surveys use the 
weight variable to calculate point estimates. Note, however, that you cannot calculate 
the standard error of 7, unless you know the stratification—you need to use (3.4) 
to estimate the variance. Partial output from SAS PROC SURVEYMEANS for the 
variable acres92 is given in below. 


Data Summary 


Number of Strata 4 
Number of Observations 300 
Sum of Weights 3078 


Statistics 


Std Error 
Variable N DF Mean of Mean 95% CL for Mean 
acres92 300 296 295561 16380 263325.000 327796.530 
Statistics 
Variable Sum Std Dev 95% CL for Sum 
acres92 909736035 50417248 810514350 1008957721 


The SAS code on the website also plots the data and finds estimates for other 
variables. m= 
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Allocating Observations to Strata 


So far we have simply analyzed data from a survey that someone else has designed. 
Designing the survey is the most important part of using a survey in research: If the 
survey is badly designed, then no amount of analysis will yield the needed information. 
Survey design includes methods for controlling nonsampling as well as sampling 
error. We discuss design issues for nonsampling error in Chapters 8 and 15. In this 
chapter, we discuss design features that affect the sampling error. Simple random 
sampling involved one design feature: the sample size (Section 2.6). For stratified 
random sampling, we need to design what the strata should be, then decide how many 
observations to sample in each stratum. It is somewhat easier to look at these in reverse 
order. In this section, we assume that the strata have already been fixed, and we discuss 
methods of allocating observations to the strata. File stratselect.sas gives sample SAS 
code that can be used to select stratified samples using the allocation methods in this 
section. 


34.1 Proportional Allocation 


If you are taking a stratified sample in order to ensure that the sample reflects the 
population with respect to the stratification variable and you would like your sample 
to be a miniature version of the population, you should use proportional allocation 
when designing the sample. 

In proportional allocation, so called because the number of sampled units in each 
stratum is proportional to the size of the stratum, the inclusion probability 2, = nj, /Np 
is the same (= n/N) for all strata; in a population of 2400 men and 1600 women, 
proportional allocation with a 10% sample would mean sampling 240 men and 160 
women. Thus the probability that an individual will be selected to be in the sample, 
n/N, is the same as in an SRS, but many of the “bad” samples that could occur in an 
SRS (for example, a sample in which all 400 persons are men) cannot be selected in 
a Stratified sample with proportional allocation. 

If proportional allocation is used, each unit in the sample represents the 
same number of units in the population: In our example, each man in the sample 
represents 10 men in the population, and each woman represents 10 women in the 
population. The sampling weight for every unit in the sample thus equals 10, and 
the stratified sampling estimator of the population mean is simply the average 
of all of the observations. Proportional allocation thus results in a self-weighting 
sample. The sample in Example 3.2 was designed to be approximately self- 
weighting. In a self-weighting sample, y.; is the average of all observations in the 
sample. 

When the strata are large enough, the variance of y,,, under proportional allocation 
is usually at most as large as the variance of the sample mean from an SRS with 
the same number of observations. This is true no matter how silly the stratification 
scheme may be. To see why this might be so, let’s display the variances between strata 
and within strata, for proportional allocation, in an ANOVA table for the population 
(Table 3.3). 
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In a stratified sample of size n with proportional allocation, since n,/Np = n/N, 
Equation (3.3) implies that 


_ os Nh 405 
Vorop (tstr) = > 1— M, Ni, i; 


h=1 


h=1 


The sums of squares add up, with SSTO = SSW + SSB, so the variance of the estimated 
population total from an SRS of size n is 


Vsrs@) = (1 - 2) eS 


n 
N2 SSTO 
=(1- a ~N =A 


= (1- a (SSW + SSB) 


7 n N = 2 
= prop (str) ele (i- =) SN n(N — 1) [visse oe Ds (N — wos : 


h=1 


The above result shows us that proportional allocation with stratification always 
gives smaller variance than SRS unless 


= a 
SSB <)> (1-—) Sj. (3.11) 
h=1 


TABLE 3.3 


Population ANOVA Table 
Source df Sum of Squares 
H Nn 
Between strata H-1 SSB = » » Qnu — juy = = 3 NaQau — yoy 
h=1 j=1 ie 
A Nn 


Within strata N-H SSW=)°Y°(0nj — Sw) = 3 (Nn — 1)S;, 
h=1 j=1 


H Np 


Total, aboutyy | N—1 SSTO= 2d, Ow - ju =(N—1)S? 
=1 j=1 
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This rarely happens when the JN, are large; generally, the large population sizes of 
the strata will force N,(¥nu — yu)? > Ss In general, the variance of the estimator 
of t from a stratified sample with proportional allocation will be smaller than the 
variance of the estimator of ¢f from an SRS with the same number of observations. 
The more unequal the stratum means yy, the more precision you will gain by using 
proportional allocation. The variance of 7; depends primarily on SSW; since SSTO is 
a fixed value for the finite population, SSW is smaller when SSB is larger. Of course, 
this result only holds for population variances; it is possible for a variance estimate 
from proportional allocation to be larger than that from an SRS merely because the 
sample selected had large within-stratum sample variances. 


Index mutual funds attempt to mimic the performance of one of the indices of overall 
stock or bond market performance. The Dow Jones Wilshire 5000 Composite IndexSM 
includes all U.S. equity securities with readily available price information. The stocks 
are weighted by market capitalization to form the index. Total stock market index 
funds attempt to have the same performance as the Wilshire 5000 Index; however, 
buying all of the stocks in the index and adjusting the holdings every time the index is 
revised would lead to excessive transaction costs. As a result, mutual fund companies 
often use stratified sampling to select stocks from the index to include in their index 
funds. The largest 500 companies in the index make up more than 70% of its value; 
most total stock market index funds include all of the companies in this stratum. 
The remaining stocks in the index mutual fund are then sampled from the remaining 
strata constructed using factors including market capitalization, industry exposures, 
dividend yield, price/earnings (P/E) ratio, price/book (P/B) ratio, and earnings growth. 
Not all index funds use stratified random sampling. The prospectuses for some funds 
state that the fund manager selects representative stocks from the different strata. In 
some cases, the fund manager uses a proprietary computer program to select stocks 
within the strata, with the goal of obtaining a better match to the index or reducing 
transaction costs. m= 


342 Optimal Allocation 


If the variances Se are more or less equal across all the strata, proportional allocation 
is probably the best allocation for increasing precision. In cases where the S. vary 
greatly, optimal allocation can result in smaller costs. In practice, when we are 
sampling units of different sizes, the larger units are likely to be more variable than 
the smaller units, and we would sample them with a higher sampling fraction. For 
example, if we were to take a sample of American corporations and our goal was 
to estimate the amount of trade with Europe, the variation among large corporations 
would be greater than the variation among small ones. As a result, we would sample 
a higher percentage of the large corporations. Optimal allocation works well for 
sampling units such as corporations, cities, and hospitals, which vary greatly in size. 
It is also effective when some strata are much more expensive to sample than others. 

Neter (1978) tells of a study done by the Chesapeake and Ohio (C&O) Railroad 
Company to determine how much revenue they should get from interline freight 
shipments, since the total freight from a shipment that traveled along several railroads 
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was divided among the different railroads. The C&O took a stratified sample of 
waybills, the documents that detailed the goods, route, and charges for the shipments. 
The waybills were stratified by the total freight charges, and all of the waybills with 
charges of over $40 were sampled, whereas only 1% of the waybills with charges less 
than $5 were sampled. The justification was that there was little variability among the 
amounts due the C&O in the stratum of the smallest total freight charges, whereas 
the variability in the stratum with charges of over $40 was much higher. 


How are musicians paid when their compositions are performed? In the United States, 
many composers are affiliated with the American Society of Composers, Authors and 
Publishers (ASCAP). Television networks, local television and radio stations, services 
such as Muzak, symphony orchestras, restaurants, nightclubs, and other operations 
pay ASCAP an annual license fee, based largely on the size of the audience, that 
allows them to play compositions in the ASCAP catalog. ASCAP then distributes 
royalties to composers whose works are played. 

Theoretically, an ASCAP member should get royalties every time one of his or her 
compositions is played. Taking a census of every piece of music played in the United 
States, however, would be impractical; to estimate the amount of royalties due to 
members, ASCAP uses sampling. According to Dobishinski (1991), Krasilovsky and 
Shemel (2003, p. 139), and www.ascap.com, ASCAP relies on television producers’ 
cue sheets, which provide details on the music used in a program, to identify and 
tabulate musical pieces played on network television and major cable channels. About 
60,000 hours of tapes are made from radio broadcasts each year, and experts identify 
the musical compositions aired in these broadcasts. 

Stratified sampling is used to sample radio stations for the survey. Radio stations 
are grouped into strata based on the license fee paid to ASCAP, the type of community 
the station is in, and the geographic region. As stations paying higher license fees 
contribute more money for royalties, they are more likely to be sampled; once in the 
sample, high-fee stations are taped more often than low-fee stations. ASCAP thus 
uses a form of optimal allocation in taping: Strata with the highest radio fees, and 
thus with the highest variability in royalty amounts, have larger sampling fractions 
than strata containing radio stations that pay small fees. = 


The objective in optimal allocation is to gain the most information for the least 
cost. A simple cost function is given below: Let C represent total cost, co represent 
overhead costs such as maintaining an office, and c, represent the cost of taking an 
observation in stratum /, so that 


H 
C=cot So can. (3.12) 
h=1 


We want to allocate observations to strata so as to minimize V(ysr) for a given 
total cost C, or, equivalently, to minimize C for a fixed V(y¢,). Suppose that the costs 


C1,C€2,.--,Cy are known. To minimize the total cost for a fixed variance, we can prove 
using calculus that the optimal allocation has nj; proportional to 
N,S} 
: (3.13) 


Ch 
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for each h (see Exercise 24). Thus, the optimal sample size in stratum h is 


dij EE al (3.14) 


We shall then sample heavily within a stratum if 


s The stratum accounts for a large part of the population. 


s The variance within the stratum is large; we sample more heavily to compensate 
for the heterogeneity. 


= Sampling in the stratum is inexpensive. 


Sometimes applying the optimal allocation formula in Equation (3.14) results in 
one or more of the “optimal” n;’s being larger than the population size Nj, in those 
strata. In that case, take a sample size of Ny, in those strata, and then apply (3.14) 
again with the remaining strata. 


Dollar stratification is often used in accounting. The recorded book amounts are 
used to stratify the population. If you are auditing the loan amounts for a financial 
institution, stratum | might consist of all loans of more than $1 million, stratum 2 
might consist of loans between $500,000 and $999,999, and so on down to the smallest 
stratum of loans less than $10,000. Optimal allocation is often an efficient strategy for 
such a stratification: S, will be much larger in the strata with the large loan amounts, 
so optimal allocation will prescribe a higher sampling fraction for those strata. If the 
goal of the audit is to estimate the dollar discrepancy between the audited amounts 
and the amounts in the institution’s books, an error in the recorded amount of one 
of the $3,000,000 loans is likely to contribute more to the audited difference than an 
error in the recorded amount of one of the $3,000 loans. In a survey such as this, you 
may even want to use sample size N; in stratum 1, so that each population unit in 
stratum | has probability 1 of appearing in the sample. = 


If all variances and costs are equal, proportional allocation is the same as optimal 
allocation. If we know the variances within each stratum and they differ, optimal 
allocation gives a smaller variance for the estimator of yy than proportional alloca- 
tion. But optimal allocation is a more complicated scheme; often the simplicity and 
self-weighting property of proportional allocation are worth the extra variance. In 
addition, the optimal allocation will differ for each variable being measured, whereas 
the proportional allocation depends only on the number of population units in each 
stratum. Stokes and Plummer (2002) describe linear programming methods that can 
be used to determine optimal allocations when more than one variable is of interest. 

Neyman allocation is a special case of optimal allocation, used when the costs 
in the strata (but not the variances) are approximately equal. Under Neyman alloca- 
tion, mp, is proportional to N;,S;,. If the variances 5S? are specified correctly, Neyman 
allocation will give an estimator with smaller variance than proportional allocation 
(see Exercise 25). 
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TABLE 3.4 


Quantities Used for Designing the Caribou Survey in Example 3.10 


| A B Cc D E F 

1 Stratum Np Sh NnSh Nh Sample size 
2 D*225/$D$9 

3 | A 3,000 | 1,200,000 

4 |B 30 2,000 60,000 4.81 10 
5 | C 61 9,000 549,000 44.04 37 
6 | D 18 2,000 36,000 2.89 6 
7 |£E 70 | 12,000 840,000 67.38 39 
8 | F 120 1,000 120,000 9.63 21 
9 | total 699 2,805,000 225 211 


EXAMPLE 3.10 The caribou survey in Example 3.3 used a form of optimal allocation to determine the 


ny. Before taking the survey, the investigators obtained approximations of the caribou 
densities and distribution, and constructed strata to be relatively homogeneous in 
terms of population density. They set the total sample size as n = 225. They then 
used the estimated count in each stratum as a rough estimate of the standard deviation, 
with the result shown in Table 3.4. The first row contains the names of the spreadsheet 
columns, and the second row contains the formulas used to calculate the table. The 
investigators wanted the sampling fraction to be at least 1/3 in smaller strata, so they 
used the optimal allocation sample sizes in column E as a guideline for determining 
the sample sizes they actually used, in column F, » 


When the stratum variances S are approximately known, Neyman allocation gives 
higher precision than proportional allocation. If the information about the stratum 
variances is of poor quality, however, disproportional allocation can result in a higher 
variance than simple random sampling. Proportional allocation, on the other hand, 
almost always has smaller variance than simple random sampling. 
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EXAMPLE 3.11 


Sometimes you are less interested in the precision of the estimate of the population 
total or mean for the whole population than in comparing means or totals among 
different strata. In that case, you would determine the sample size needed for the 
individual strata using the guidelines in Section 2.6. 


The U.S. Postal Service often conducts surveys asking postal customers about their 
perceptions of the quality of mail service. The population of residential postal service 
customers is stratified by geographic area, and it is desired that the precision be 
+3 percentage points, at a 95% confidence level, within each area. If there were no 
nonresponse, such a requirement would lead to sampling at least 1067 households in 
each stratum, as calculated in Example 2.11. Such an allocation is neither proportional, 
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as the number of residential households in the population vary a great deal from 
stratum to stratum, nor optimal in the sense of providing the greatest efficiency for 
estimating percentages for the whole population. It does, however, provide the desired 
precision within each stratum. 


344 Determining Sample Sizes 


3.9 


The different methods of allocating observations to strata give the relative sample 
sizes n,,/n. After strata are constructed (see Section 3.5) and observations allocated to 
strata, Equation (3.3) can be used to determine the sample size necessary to achieve 
a prespecified margin of error. Recall that 


_ tes Nn : 2 Vv 
VOstr) s fa > nh (=) Si, _ n 
where vy = = (n/nn) (Nn /N)* Nie Thus, if the fpcs can be ignored and if the 
normal approximation is valid, an approximate 95% CI for the population mean will 
be Ystr = Za/2/v/n. Setn = ze /2¥/ e” to achieve a desired margin of error e. 

The quantity v depends on the stratum population sizes N;, and variances a. 
and on the relative sample sizes n;/n. If we took an SRS of size n instead of a 
stratified random sample, the variance of ysps would be (again, ignoring the fpc) 
S?/n. Thus, S? can be thought of as the variability per observation unit in an SRS, 
and v can be thought of as the “average” variability per observation unit in a stratified 
random sample with the specified allocation. We substitute v for S* in the sample 
size (without fpc) formula (2.24) to obtain the necessary sample size for stratified 
sampling. If sampling fractions in the strata are high, this sample size can be adjusted 
for the finite population corrections. In Section 7.5, we shall use a similar method to 
find the necessary sample size for any survey design. 


Defining Strata 


One might wonder, since stratified sampling almost always gives higher precision 
than simple random sampling, why anyone would ever take a sample that is not 
stratified. The answer is that stratification adds complexity to the survey, and the 
added complexity may not be worth a small gain in precision. In addition, we need to 
have information in the sampling frame that can be used to form the strata. For each 
stratum, we need to know how many and which members of the population belong to 
that stratum. When such information is available, however, stratification is often well 
worth the effort. 

Remember, stratification is most efficient when the stratum means differ widely; 
then the between sum of squares is large, and the variability within strata will be 
smaller. Consequently, when constructing strata we want the strata means to be as 
different as possible. Ideally, we would stratify by the values of y; if our survey is to 
estimate total business expenditures on advertising, we would like to put businesses 
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that spent the most on advertising in stratum 1, businesses with the next highest level 
of advertising expenditures in stratum 2, and so on, until the last stratum contained 
businesses that spent nothing on advertising. The problem with this scheme is that 
we do not know the advertising expenditures for all the businesses while designing 
the survey—if we did, we would not need to do a survey at all! Instead, we try to 
find some variable closely related to y. For estimating total business expenditures on 
advertising, we might stratify by number of employees or size of the business and by 
the type of product or service. For farm income, we might use the size of the farm as 
a stratifying variable, since we expect that larger farms would have higher incomes. 

Most surveys measure more than one variable, so any stratification variable should 
be related to many characteristics of interest. The U.S. Current Population Survey, 
which measures characteristics relating to employment, stratifies the areas that form 
the primary sampling units by geographic region, population density, racial composi- 
tion, principal industry, and similar variables. In the Canadian Survey of Employment, 
Payrolls, and Hours, business establishments are stratified by industry, province, and 
estimated number of employees. The Nielsen television ratings stratify by geographic 
region, county size, and cable penetration, among other variables. If several stratifi- 
cation variables are available, use the variables associated with the most important 
responses. 

The number of strata you choose depends upon many factors such as the diffi- 
culty in constructing a sampling frame with stratifying information, and the cost of 
stratifying. A general rule to keep in mind is: The more information, the more strata 
you should use. Thus, you should use an SRS when little prior information about the 
target population is available. 

You can often collect preliminary data that can be used to stratify your design. 
If you are taking a survey to estimate the number of fish in a region, you can use 
physical features of the area that are related to fish density, such as depth, salinity, and 
water temperature. Or you can use survey information from previous years, or data 
from a preliminary cruise to aid in constructing strata. In this situation, according to 
Saville (1977, p. 10): “Usually there will be no point in designing a sampling scheme 
with more than 2 or 3 strata, because our knowledge of the distribution of fish will be 
rather imprecise. Strata may be of different size, and each stratum may be composed 
of several distinct areas in different parts of the total survey area.” In a survey with 
more precise prior information, we will want to use more strata—many surveys are 
stratified to the point that only two sampling units are observed in each stratum. 

For many surveys, stratification can increase precision dramatically, and often 
well repays the effort used in constructing the strata. Example 3.12 tells how strata 
were constructed in one large-scale survey, the National Pesticide Survey. 


Between 1988 and 1990, the U.S. Environmental Protection Agency (1990a, b) sam- 
pled drinking water wells to estimate the prevalence of pesticides and nitrate. When 
designing the National Pesticide Survey (NPS), the EPA scientists wanted a sample 
that was representative of drinking water wells in the United States. In particular, they 
wanted to guarantee that wells in the sample would have a wide range of levels of pes- 
ticide use and susceptibility to ground-water pollution. They also wanted to study two 
categories of wells: community water systems (CWSs), defined as “systems of piped 
drinking water with at least 15 connections and/or 25 or more permanent residents of 
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the service area that have at least one working well used to obtain drinking water”; 
and rural domestic wells, “drinking water wells supplying occupied housing units 
located in rural areas of the United States, except for wells located on government 
reservations.” 

The following selections from the EPA describe how it chose the strata for the 
survey: 


In order to determine how many wells to visit for data collection, EPA first needed 
to identify approximately how many drinking water wells exist in the United States. 
This process was easier for community water systems than for rural domestic wells 
because a list of all public water systems, with their addresses, is contained in the 
Federal Reporting Data System (FRDS), which is maintained by EPA. From FRDS, 
EPA estimated that there were approximately 51,000 CWSs with wells in the United 
States. EPA did not have a comprehensive list of rural domestic wells to serve as the 
foundation for well selection, as it did for CWSs. Using data from the Census Bureau 
for 1980, EPA estimated that there were approximately 13 million rural domestic wells 
in the country, but the specific owners and addresses of these rural domestic wells were 
not known. 

EPA chose a survey design technique called “stratification” to ensure that survey 
data would meet its objectives. This technique was used to improve the precision of the 
estimates by selecting extra wells from areas with substantial agricultural activity and 
high susceptibility to ground-water pollution (vulnerability). EPA developed criteria 
for separating the population of CWS wells and rural domestic wells into four cate- 
gories of pesticide use and three relative ground-water vulnerability measures. This 
design ensures that the range of variability that exists nationally with respect to the 
agricultural use of pesticides and ground-water vulnerability is reflected in the sample 
of wells. 

EPA identified five subgroups of wells for which it was interested in obtaining 
information. These subgroups were community water system wells in counties with 
relatively high average ground-water vulnerability; rural domestic wells in counties 
with relatively high average ground-water vulnerability; rural domestic wells in coun- 
ties with high pesticide use; rural domestic wells in counties with both high pesticide 
use and relatively high average ground-water vulnerability; and rural domestic wells 
in “cropped and vulnerable” parts of counties (high pesticide use and relatively high 
ground-water vulnerability). 

Two of the most difficult design questions were determining how many wells to 
include in the Survey and determining the level of precision that would be sought 
for the NPS national estimates. These two questions were connected, because greater 
precision is usually obtained by collecting more data. Resolving these questions would 
have been simpler if the Survey designers had known in advance what proportion of 
wells in the nation contained pesticides, but answering that question was one of the 
purposes of the Survey. Although many State studies have been conducted for specific 
pesticides, no reliable national estimates of well water contamination existed. EPA 
evaluated alternative precision requirements and costs for collecting data from different 
numbers of wells to determine the Survey size that would meet EPA’s requirements 
and budget. 
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The Survey designers ultimately selected wells for data collection so that the 
Survey provided a 90 percent probability of detecting the presence of pesticides in the 
CWS wells sampled, assuming 0.5 percent of all community water system wells in 
the country contained pesticides. The rural domestic well Survey design was structured 
with different probabilities of detection for the several subgroups of interest, with 
the greatest emphasis placed on the cropped and vulnerable subcounty areas, where 
EPA was interested in obtaining very precise estimates of pesticide occurrence. EPA 
assumed that | percent of rural domestic wells in these areas would contain pesticides 
and designed the Survey to have about a 97 percent probability of detection in “cropped 
and vulnerable” areas if the assumption proved accurate. EPA concluded that sampling 
approximately 1,300 wells (564 public wells and 734 private wells) would meet the 
Survey’s accuracy specifications and provide a representative national assessment of 
the number of wells containing pesticides. 

Selecting Wells for the Survey. Because the exact number and location of rural 
domestic wells was unknown, EPA chose a survey design composed of several steps 
(stages) for those wells. The design began with a sampling of counties, and then 
characterized pesticide use and ground-water vulnerability for subcounty areas. This 
eventually allowed small enough geographic areas to be delineated to enable the sam- 
pling of individual rural domestic wells. This procedure was not needed for community 
water system wells, because their number and location were known. 

The first step in well selection was common to both CWS wells and rural domestic 
wells. Each of the 3,137 counties or county equivalents in the U.S. was characterized 
according to pesticide use and ground-water vulnerability to ensure that the variability 
in agricultural pesticide use and ground-water vulnerability was reflected in the Sur- 
vey. EPA used data on agricultural pesticide use obtained from a marketing research 
source and information on the proportion of the county area that was in agricultural 
production to rank agricultural pesticide use for each county as high, medium, low, or 
uncommon. Ground-water vulnerability of each county was estimated using a numer- 
ical classification system called Agricultural DRASTIC, which assesses seven factors: 
(depth of water, recharge, aquifer media, soil media, topography, impact of unsaturated 
zone, conductivity of the aquifer). The model was modified for the Survey to evaluate 
the vulnerability of aquifers to pesticide and nitrate contamination, and one of the 
subsidiary purposes of the Survey was to assess the effectiveness of the DRASTIC 
classification. Each area was evaluated and received a score of high, moderate, or low, 
based on information obtained from U.S. Geological Survey maps, U.S. Department 
of Agriculture soil survey maps and other resources from State agencies, associations, 
and universities. (1990a) 


The procedure resulted in 12 strata for counties, as given in Table 3.5. 

Stratification provides several advantages in this survey. It allows for more precise 
estimates of pesticide and nitrate concentrations in the United States as a whole, as 
it is expected that the wells within a stratum are more homogeneous than the entire 
population of wells. Stratification ensures that wells for each level of pesticide use and 
ground-water vulnerability are included in the sample, and allows estimation of pesti- 
cide concentration with prespecified sample size in each stratum. The factorial design, 
with four levels of the factor pesticide use and three levels of the factor groundwater 
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TABLE 3.5 
Strata for National Pesticide Survey 


Groundwater Vulnerability Number of 
Stratum Pesticide Use (as Estimated by DRASTIC) Counties 

1 High High 106 

2 High Moderate 234 

3 High Low 129 

4 Moderate High 110 

5 Moderate Moderate 204 

6 Moderate Low 267 

7 Low High 193 

8 Low Moderate 375 

9 Low Low 404 
10 Uncommon High 186 
11 Uncommon Moderate 513 
12 Uncommon Low 416 


Source: Adapted from U.S. EPA (1990a), p. 3. 


vulnerability, allows investigation of possible effects of each factor separately, and 
the interaction of the factors, on pesticide concentrations. us 


3.6 
Model-Based Inference for Stratified 
Sampling* 


The one-way ANOVA model with fixed effects provides an underlying structure for 
stratified sampling. Here, 


Ynj = Ln + €nj, (3.15) 


where the €),;’s are independent with mean 0 and variance o;. Then the least squares 
estimator of 4, is Y;,, the average in stratum h. 
Let the random variable 


Nh 


Th = bs Yj 
j=l 


represent the total in stratum / and the random variable 


H 
T= > 7 
h=1 
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represent the overall total. From Section 2.9, the best linear unbiased estimator for 


Ty is 
a N h 
t= > Yj. 
JES 
Then, from the results shown for simple random sampling in Section 2.9, 
Em(Th — Ty] =0 


and 


2 

aS Nh \ 9, 
Eu[(Th — Ts)?] = N27 (1—-—) =. 
mu (Ty n) | h ( m+) i 


Since observations in different strata are independent under the model in (3.15), 


H 2 
Ey(T — TY] = Ey ps (T, - no| 


h=1 


H A 
= Ey | )\ tn -T™? + > 0 Gh -— THT - Te) 


h=1 h=1 k#h 


The theoretical variance a; can be estimated by a. Adopting the model in (3.15) 


results in the same estimators for tf and its standard error as found under randomization 
theory in (3.5). If a different model is used, however, then different estimators are 
obtained. 


i] 
Quota Sampling 


Many samples that masquerade as stratified random samples are actually quota sam- 
ples. In quota sampling, the population is divided into different subpopulations just 
as in stratified random sampling, but with one important difference: Probability sam- 
pling is not used to choose individuals in the subpopulation for the sample. In extreme 
versions of quota sampling, choice of units in the sample is entirely at the discretion of 
the interviewer, so that a sample of convenience is chosen within each subpopulation. 

In quota sampling, specified numbers (quotas) of particular types of population 
units are required in the final sample. For example, to obtain a quota sample with 
n = 3000, you might specify that the sample contain 1000 white males, 1000 white 
females, 500 men of color, and 500 women of color, but you might give no further 
instructions about how these quotas are to be filled. Thus, quota sampling is not a 
form of probability sampling—we do not know the probabilities with which each 
individual is included in the sample. It is often used when probability sampling is 
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impractical, overly costly, or considered unnecessary, or when the persons designing 
the sample just do not know any better. 

The big drawback of quota sampling is that we do not know if the units chosen for 
the sample exhibit selection bias. If selection of units is totally up to the interviewer, 
she or he is likely to choose the most accessible members of the population—for 
instance, persons who are easily reached by telephone, households without menacing 
dogs, or areas of the forest close to the road. The most accessible members of a popula- 
tion are likely to differ in a systematic way from less accessible members. Thus, unlike 
in stratified random sampling, we cannot say that the estimator of the population total 
from quota sampling is unbiased over repeated sampling—one of our usual criteria 
of goodness in probability samples. In fact, in quota samples, we cannot measure 
sampling error over repeated samples and we have no way of estimating the bias from 
the sample data. Since selection of units is up to the individual interviewer, we can- 
not expect that repeating the sample will give similar results. Thus, anyone drawing 
inferences from a quota sample must necessarily take a model-based approach. 


The 1945 survey on reading habits taken for the Book Manufacturer’s Institute (Link 
and Hopf, 1946), like many surveys in the 1940s and 1950s, used a quota sample. Some 
of the classifications used to define the quota classes were area, city size, age, sex, 
and socioeconomic status; a local supervising psychologist in each city determined 
the blocks of the city in which interviewers were to interview people from a specified 
socioeconomic group. The interviewers were then allowed to choose the specific 
households to be interviewed in the designated city blocks. 

The quota procedure followed in the survey did not result in a sample that reflected 
demographic characteristics of the 1945 U.S. population. The following table com- 
pares the educational background of the survey respondents with figures from the 
1940 U.S. Census, adjusted to reflect the wartime changes in population. 


4,000 People U.S. Census, 

Distribution by Interviewed Urban and Rural Nonfarm 
Educational Levels (%) (%) 

8th grade or less 28 48 

1-3 years high school 18 19 

4 years high school 25 21 

1-3 years college 15 7 

4 or more years college 13 5 


Source: Link and Hopf (1946). 


The oversampling of better-educated persons casts doubt on many of the statistics 
given in the book. The study concluded that 31% of “active readers” (those who had 
read at least one book in the past month) had bought the last book they read, and that 
25% of all last books read by active readers cost $1 or less. Who knows whether a 
stratified random sample would have given the same results? a 


In the 1948 U.S. presidential elections, all of the major polls printed just a few 
days before the election predicted that Dewey would defeat Truman handily. In fact, 
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of course, Truman won the election. According to Mosteller et al. (1949), one of the 
problems of those polls was that they all used quota sampling, not a probability-based 
method—the polling debacle in 1948 spurred many survey organizations in the United 
States to turn away from quota sampling, at least for a few years. The polls that erred in 
predicting the winner in the British general election of 1992 all used quota methods in 
selecting persons to interview in their homes or in the street; the primary quota classes 
used were sex, age, socio-economic class, and employment status. Although we may 
never know exactly what went wrong in those polls (see Crewe, 1992, for some other 
explanations), the use of quota samples may have played a part—if interviewing 
persons “‘in the street,” it is certainly plausible that persons from a quota class that are 
accessible differ from persons that are less accessible. 

While quota sampling is not as good as probability sampling under ideal condi- 
tions, it may give better results than a completely haphazard sample because it at least 
forces the inclusion of members of the different quota groups. Quota samples have 
the advantage of being less expensive than probability samples. The quality of the 
data from quota samples can be improved by allowing the interviewer less discretion 
in the choice of persons or households to be included in the sample. Many survey 
organizations use probability sampling along with quotas; they use probability sam- 
pling to select small blocks of potential respondents, and then take a quota sample 
within each block, using variables such as age, sex, and race. 

Because we do not know the probabilities with which units were sampled, we must 
take a model-based approach, and make strong assumptions about the data structure, 
when analyzing data from a quota sample. The model generally adopted is that of 
Section 3.6—within each subclass the random variables generating the subpopulation 
are assumed independent and identically distributed. Such a model implies that any 
selection of units from the quota class will give a representative sample; if the model 
holds, then quota sampling will likely give good estimates of the population quantity. 
If the model does not hold, then the estimates from quota sampling may be badly 
biased. 


Sanzo et al. (1993) used a combination of stratified random sampling and quota sam- 
pling for estimating the prevalence of Coxiella burnetii infection within the Basque 
country in northern Spain. Coxiella burnetii can cause Q fever, which can lead to 
complications such as heart and nerve damage. Reviews of Q fever patient records 
from Basque hospitals showed that about three-fourths of the victims were male, 
about half were between 16 and 30 years old, and victims were disproportionately 
likely to be from areas with low population density. 

The authors stratified the target population by population density and then ran- 
domly selected health care centers from the three strata. In selecting persons for 
blood testing, however, “a probabilistic approach was rejected as we considered that 
the refusal rate of blood testing would be high” (p. 1185). Instead, they used quota 
sampling to balance the sample by age and gender; physicians asked patients who 
needed laboratory tests whether they would participate in the study, and recruited 
subjects for the study until the desired sample sizes in the six quota groups were 
reached for each stratum. 

Because a quota sample was taken instead of a probability sample, persons ana- 
lyzing the data must make strong assumptions about the representativeness of the 
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sample in order to apply the results to the general population of the Basque country. 
First, the assumption must be made that persons attending a health clinic for labora- 
tory tests (the sampled population of the study) are neither more nor less likely to be 
infected than persons who would not be visiting the clinic. Second, one must assume 
that persons who are requested and agree to do the study are similar in terms of the 
infection to persons in the same quota class having laboratory tests that do not partic- 
ipate in the study. These are strong assumptions: the authors of the article argue that 
the assumptions are justified, but of course they cannot prove that the assumptions 
hold unless follow-up investigations are done. 

If they had taken a probability sample of persons instead of the quota sample, 
they would not have had to make these strong assumptions. A probability sample of 
persons, however, would have been exhorbitantly expensive when compared with the 
quota sampling scheme used, and a probability sample would also have taken longer 
to design and implement. With the quota sample, the authors were able to collect 
information about the public health problem; it is unclear whether the results can be 
generalized to the entire population, but the data do provide a great deal of quick 
information on the prevalence of infection that can be used in future investigation of 
who is likely to be infected, and why. u 


Deville (1991, p. 177) argues that quota samples may be useful for market research, 
when the organization requesting the survey is aware of the model being used. Persons 
collecting official statistics about crime, unemployment, or other matters that are used 
for setting public policy should use probability samples, however. 

Quota samples, while easier to collect than a probability sample, suffer from 
the same disadvantages as other convenience samples. Some survey organizations 
now use quota sampling to recruit volunteers for online surveys; they accumulate 
respondents until they have specified sample sizes in the desired demographic classes. 
In such online surveys, the respondents in each quota class are self-selected—if, as 
argued by Couper (2000), Internet users who volunteer for such surveys differ from 
members of the target population in those quota classes, results will be biased. 


Chapter Summary 


Stratification uses additional information about a population in the survey design. In 
the simplest form, stratified random sampling, we take an SRS of size np, in stratum 
h, for each of the H strata in the population. To use stratification, we must know 
the population size N;, for each stratum; we must also know the stratum membership 
for every unit in the population. The inclusion probability for unit i in stratum h is 
Ini = Ny /Np; consequently, the sampling weight for that unit is w,; = N;/nn.- 

To estimate the population total f using a stratified random sample, let 7, estimate 
the population total in stratum h. Then 


H 


H 
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The population mean yy = t/N is estimated by 
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with V str) = Viste) /N?. 

Stratified sampling has three major design issues: defining the strata, choosing 
the total sample size, and allocating the observations to the defined strata. With pro- 
portional allocation, the same sampling fraction is used in each stratum. Proportional 
allocation almost always results in smaller variances for estimated means and totals 
than simple random sampling. Disproportional allocation may be preferred if some 
strata should have higher sampling fractions than others, for example, if it is desired 
to have larger sample sizes for strata with minority populations or for strata with large 
companies. Optimal allocation specifies taking larger sampling fractions in strata that 
have larger variances. 


Key Terms 


Disproportional allocation: Allocation of sampling units to strata so that the sam- 
pling fractions n;,/Np are unequal. 

Optimal allocation: Allocation of sampling units to strata so that the variance of the 
estimator is minimized for a given total cost. 

Proportional allocation: Allocation of sampling units to strata so that n,/N;, = n/N 
for each stratum. Proportional allocation results in a self-weighting sample. 

Quota sampling: A nonprobability sampling method which many persons confuse 
with stratified sampling. In quota sampling, quota classes are formed that serve the 
role of strata, but the survey taker uses a nonprobability sampling method such as 
convenience sampling to reach the desired sample size in each quota class. 
Stratified random sampling: Probability sampling method in which population 
units are partitioned into strata, and then an SRS is taken from each stratum. 
Stratum: One of the subpopulations or classes that make up the entire population. 
Every unit in the population is in exactly one stratum. 


For Further Reading 


The references in Chapter 2 also describe stratified sampling. Chapter 4 of Raj (1968) 
gives a rigorous and concise treatment of stratified sampling theory. Cochran (1977) 
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has further results on allocation and construction of strata, and uses ANOVA tables 
to compare precisions of sampling designs (first described in Cochran, 1939). 
Neyman (1934) wrote one of the most important papers in the historical devel- 
opment of survey sampling. He presented a framework for stratified sampling and 
demonstrated its superiority to purposive selection methods. Neyman’s paper pretty 
much finished off the idea that results from purposive samples could be generalized 
to the population. He presented an example of a sample of 29 districts, purposely 
chosen to give the averages of all 214 districts in the 1921 Italian Census on a dozen 
variables. But Neyman showed that “all statistics other than the average values of the 
controls showed a violent contrast between the sample and the whole population.” 


A. Introductory Exercises 


What stratification variable(s) would you use for each of the following situations: 

a_ A political poll to estimate the percentage of registered voters in Arizona that 
approve of the governor’s performance. 

b An e-mail survey of students at your university, to estimate the total amount of 
money students spend on textbooks in a term. 

ec A sample of high schools in New York City to estimate what percentage of high 
schools offer one or more classes in computer programming. 

d A sample of public libraries in California to study the availability of computer 
resources, and the per capita expenditures. 

e A survey of anglers visiting a freshwater lake, to learn about which species of fish 
are preferred. 

f An aerial survey to estimate the number of walrus in the pack ice near Alaska 
between 173° East and 154° West longitude. 

g Asample of prime-time (7-10 pm, Monday through Saturday; 6-10 pm, Sunday) 
TV programs on CBS to estimate the average number of promotional announce- 
ments (ads for other programming on the station) per hour of broadcast. 


Consider the hypothetical population below (this population is also used in Exam- 
ple 2.2). Consider the stratification below, with Nj; = Nz = 4. The population is: 


Unit number Stratum y 
1 1 1 
2 1 2 
3 1 4 
8 1 8 
4 2 4 
5 2 7 
6 2 7 
7 2 7 


Consider the stratified sampling design in which ny = nz = 2. 
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a Write out all possible SRSs of size 2 from stratum 1, and find the probability of 
each sample. Do the same for stratum 2. 


b Using your work in (a), find the sampling distribution of fcr. 
ce Find the mean and variance of the sampling distribution of ter. How do these 
compare to the mean and variance in Example 2.2? 


Consider a population of 6 students. Suppose we know the test scores of the students 
to be 


Student | 1 2 3 4 5 6 


Score | 66 59 70 83 82 71 


a Find the mean jy and variance S” of the population. 
b How many SRS’s of size 4 are possible? 


c List the possible SRS’s. For each, find the sample mean. Using Equation (2.9), 
find V(y). 

d Now let stratum | consist of students 1-3, and stratum 2 consist of students 4—6. 
How many stratified random samples of size 4 are possible in which 2 students 
are selected from each stratum? 


e List the possible stratified random samples. Which of the samples from (c) cannot 
occur with the stratified design? 


f Find y. for each possible stratified random sample. Find V(¥str), and compare it 


to V(y). 


For Example 3.4, construct a data set with 3835 observations. Include three columns: 
column | is the stratum number (from | to 7), column 2 contains the response variable 
of gender (0 for males and | for females), and column 3 contains the sampling weight 
N;,/Mn for each observation. Using columns 2 and 3 along with (3.10), calculate per. 
Is it possible to calculate SE.) by using only columns 2 and 3, with no additional 
information? Explain. 


The survey in Example 3.4 collected much other data on the subjects. Another of the 
survey’s questions asked whether the respondent agreed with the following statement: 
“When I look at a new issue of my discipline’s major journal, I rarely find an article 
that interests me.” The results are as follows: 


Discipline Agree (%) 
Literature 37 
Classics 23 
Philosophy 23 
History 29 
Linguistics 19 
Political Science 43 
Sociology 41 


a What is the sampled population in this survey? 


b 
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Find an estimate of the percentage of persons in the sampled population that agree 
with the statement, and give the standard error of your estimate. 


Suppose that a city has 90,000 dwelling units, of which 35,000 are houses, 45,000 
are apartments, and 10,000 are condominiums. 


You believe that the mean electricity usage is about twice as much for houses as 
for apartments or condominiums, and that the standard deviation is proportional 
to the mean so that $; = 2S, = 2.83. How would you allocate a stratified sample 
of 900 observations if you wanted to estimate the mean electricity consumption 
for all households in the city? 


Now suppose that you take a stratified random sample with proportional alloca- 
tion and want to estimate the overall proportion of households in which energy 
conservation is practiced. If 45% of house dwellers, 25% of apartment dwellers, 
and 3% of condomium residents practice energy conservation, what is p for the 
population? What gain would the stratified sample with proportional allocation 
offer over an SRS, that is, what is Vprop(Pstr)/Vsrs(Psrs)? 


In Exercise 6 of Chapter 2, data on numbers of publications were given for an SRS 
of 50 faculty members. Not all departments were represented, however, in the SRS. 
The SRS contained several faculty members from psychology and from chemistry, 
but none from foreign languages. The following data are from a stratified sample of 
faculty, using the areas biological sciences, physical sciences, social sciences, and 
humanities as the strata. 


Number of Number of 
Faculty Members Faculty Members 

Stratum in Stratum in Sample 
Biological Sciences 102 7 
Physical Sciences 310 19 
Social Sciences 217 13 
Humanities 178 11 
Total 807 50 


The frequency table for number of publications in the strata is given below. 


Number of Number of Faculty Members 
Refereed Publications Biological Physical Social Humanities 
0 1 10 9 8 
1 2 2 0 2 
2 0 0 1 0 
2 1 1 0 1 
4 0 2 2 0 
| 2 1 0 0 
6 0 1 1 0 
7 1 0 0 0 
8 0 2 0 0 
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a_ Estimate the total number of refereed publications by faculty members in the 
college, and give the standard error. 


b How does your result from (a) compare with the result from the SRS in Exercise 6 
of Chapter 2? 


c Estimate the proportion of faculty with no refereed publications, and give the 
standard error. 


d__ Did stratification increase precision in this example? Explain why you think it did 
or did not. 


A public opinion researcher has a budget of $20,000 for taking a survey. She knows that 
90% of all households have telephones. Telephone interviews cost $10 per household; 
in-person interviews cost $30 each if all interviews are conducted in person, and $40 
each if only nonphone households are interviewed in person (because there will be 
extra travel costs). Assume that the variances in the phone and nonphone groups are 
similar, and that the fixed costs are co = $5000. How many households should be 
interviewed in each group if 


a__all households are interviewed in person 


b households with a phone are contacted by telephone and households without a 
phone are contacted in person. 


B. Working with Survey Data 


The data file agstrat.dat also contains information on other variables. For each of the 
following quantities, plot the data, and estimate the population mean for that variable 
along with its standard error and a 95% CI. Compare your answers with those from 
the SRS in Exercise 15 of Chapter 2. 


a Number of acres devoted to farms, 1987 
b Number of farms, 1992 
c Number of farms with 1000 acres or more, 1992 


d Number of farms with 9 acres or fewer, 1992 


Hard shell clams may be sampled by using a dredge. Clams do not tend to be uniformly 
distributed in a body of water, however, as some areas provide better habitat than 
others. Thus, taking a simple random sample is likely to result in a large estimated 
variance for the number of clams in an area. Russell (1972) used stratified random 
sampling to estimate the total number of bushels of hard shell clams (Mercenaria 
mercenaria) in Narragansett Bay, Rhode Island. The area of interest was divided 
into four strata based on preliminary surveys that identified areas in which clams 
were abundant. Then n, dredge tows were made in stratum h, for h = 1, 2, 3, 4. 
The acreage for each stratum was known, and Russell calculated that the area fished 
during a standard dredge tow was 0.039 acres, so that we may use N;, = 25.6 x 
Areay. 
a Here are the results from the survey taken before the commercial season. Estimate 
the total number of bushels of clams in the area, and give the standard error of 
your estimate. 
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Average Number Sample 
Area Number of of Bushels Variance 
Stratum (Acres) Tows Made per Tow for Stratum 
1 222.81 4 0.44 0.068 
2 49.61 6 1.17 0.042 
3 50.25 3 3.92 2.146 
4 197.81 5 1.80 0.794 


b Another survey was performed at the end of the commercial season. In this survey, 
strata 1, 2, and 3 were collapsed into a single stratum, called stratum 1 below. 
Estimate the total number of bushels of clams (with standard error) at the end of 


the season. 
Average Number Sample 
Area Number of of Bushels Variance 
Stratum (Acres) Tows Made per Tow for Stratum 
1 322.67 8 0.63 0.083 
4 197.81 5 0.40 0.046 


Lydersen and Ryg (1991) used stratification techniques to estimate ringed seal pop- 
ulations in Svalbard fjords. The 200 km? study area was divided into three zones: 
Zone 1, outer Sassenfjorden, was covered with relatively new ice during the study 
period in March, 1990, and had little snow cover; Zone 3, Tempelfjorden, had a stable 
ice cover throughout the year; Zone 2, inner Sassenfjorden, was intermediate between 
the stable Zone 3 and the unstable Zone |. Ringed seals need good ice to establish 
territories with breathing holes, and snow cover enables females to dig out birth lairs. 
Thus, it was thought that the three zones would have different seal densities. The 
investigators took a stratified random sample of 20% of the 200 1-km? areas. The 
following table gives the number of plots, and the number of plots sampled, in each 
zone: 


Number Plots 
Zone of Plots Sampled 


1 68 17 
2 84 12 
3 48 11 
Total 200 40 


In each sampled area, Imjak the Siberian husky tracked seal structures by scent; the 
number of breathing holes in sampled square was recorded. A total of 199 breathing 
holes were located in zones 1, 2, and 3 altogether. The data (reconstructed from 
information given in the paper) are in the file seals.dat. 


a_ Estimate the total number of breathing holes in the study region, along with its 
standard error. 
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b_ If you were designing this survey, how would you allocate observations to strata 
if the goal is to estimate the total number of breathing holes? If the goal is to 
compare the density of breathing holes in the three zones? 


Proportional allocation was used in the stratified sample in Example 3.2. It was noted, 
however, that variability was much higher in the West than in the other regions. 
Using the estimated variances in Example 3.2, and assuming that the sampling costs 
are the same in each stratum, find an optimal allocation for a stratified sample of 
size 300. 


Select a stratified random sample of size 300 from the data in the file agpop.dat, using 
your allocation in Exercise 12. Estimate the total number of acres devoted to farming 
in the United States, and give the standard error of your estimate. How does this 
standard error compare with that found in Example 3.2? 


Burnard (1992) sent a questionnaire to a stratified sample of nursing tutors and students 
in Wales, to study what the tutors and students understood by the term experiential 
learning. The population size and sample size obtained for each of the four strata are 
given below: 


Stratum Population Size Sample Size 
General nursing tutors (GT) 150 109 
Psychiatric nursing tutors (PT) 34 26 
General nursing students (GS) 2680 222 
Psychiatric nursing students (PS) 570 40 
Total 3434 397 


Respondents were asked which of the following techniques could be identified as 
experiential learning methods; the number of students in each group who identified 
the method as an experiential learning method are given below: 


Method GS PS PT GT 
Role play 213 38 26 104 
Problem solving activities 182 33 22 95 
Simulations 95 20 22 64 
Empathy-building exercises 89 25 20 54 
Gestalt exercises 24 4 5 12 


Estimate the overall percentage of nursing students and tutors who identify each of 
these techniques as “experiential learning.” Be sure to give standard errors for your 
estimates. 


Hayes (2000) took a stratified sample of New York City food stores. The sampling 
frame consisted of 1408 food stores with at least 4000 square feet of retail space. The 
population of stores was stratified into three strata using median household income 
within the zip code. The prices of a “market basket” of goods were determined for 
each store; the goal of the survey was to investigate whether prices differ among the 
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three strata. Hayes used the logarithm of total price for the basket as the response y. 
Results are given in the following table: 


Stratum, h Np Np Vh Sh 


1 Low income 190 21 3.925 0.037 
2 Middle income 407 14 3.938 0.052 
3 Upper income 811 22 3.942 0.070 


a_ The planned sample size was 30 in each stratum; this was not achieved because 
some stores went out of business while the data were being collected. What are 
the advantages and disadvantages of sampling the same number of stores in each 
stratum? 


b_ Estimate yy for these data and give a 95% CI. 


c_ Is there evidence that prices are different in the three strata? 


Kruuk et al. (1989) used a stratified sample to estimate the number of otter (Lutra lutra) 
dens along the 1400-km coastline of Shetland, UK. The coastline was divided into 
242 (237 that were not predominantly buildings) 5-km sections, and each section was 
assigned to the stratum whose terrain type predominated. Then sections were chosen 
randomly from the sections in each stratum. In each section chosen, the investigators 
counted the total number of dens in a 110-m-wide strip along the coast. The data are 
in file otters.dat. The population sizes for the strata are as follows: 


Total Sections 

Stratum Sections Counted 
1 Cliffs over 10m 89 19 
2 Agriculture 61 20 
3 Not | or 2, peat 40 22 
4 Not | or 2, non-peat 47 21 


a Estimate the total number of otter dens in Shetland, along with a standard error 
for your estimate. 


b_ Discuss possible sources of bias in this study. Do you think it is possible to avoid 
all selection and measurement bias? 


Marriage and divorce statistics are compiled by the National Center for Health Statis- 
tics and published in volumes of Vital Statistics of the United States. State and local 
officials provide NCHS with annual counts of marriages and divorces in each county. 
In addition, some states send computer tapes of additional data, or microfilm copies 
of marriage or divorce certificates to NCHS. These additional data are used to cal- 
culate statistics about age at marriage or divorce, previous marital status of marrying 
couples, and children involved in divorce. In 1987, if a state sent a computer tape, 
all records were included in the divorce statistics; if a state sent microfilm copies, a 
specified fraction of the divorce certificates was randomly sampled and data recorded. 
The sampling rates (probabilities of selection) and number of records sampled in each 
state in the divorce registration area for 1987 are in file divorce.dat. 
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a How many divorces were there in the divorce registration area in 1987? HINT: 
Construct and use the sampling weights. 


b Why did NCHS use different sampling rates in different states? 


ce Estimate the total number of divorces granted to men aged 24 or less. To women 
aged 24 or less. Give 95% Cls for your estimates. 


d= In what proportion of all divorces is the husband between 40 and 50 years old? 
In what proportion is the wife between 40 and 50? Give 95% CIs for your 
estimates. 


Wilk et al. (1977) reported data on the number and types of fishes and environmental 
data for the area of the Atlantic continental shelf between eastern Long Island, New 
York and Cape May, New Jersey. The ocean survey area was divided into strata based 
on depth. Sampling was done at a higher rate close to shore than farther away from 
shore: “In-shore strata (0-28 m) were sampled at a rate of approximately one station 
per 515 km? and off-shore strata (29-366 m) were sampled at a rate of approximately 
one station per 1,030 km2” (p. 1). Thus each record in strata 3-6 represents twice as 
much area as each record in strata 1 and 2. In calculating average numbers of fish 
caught and numbers of species, we may use a relative sampling weight of 1 for strata 
1 and 2, and weight 2 for strata 3-6. 


Stratum Depth (m) Relative Sampling Weight 


1 0-19 
2 20-28 
3 29-55 
4 56-100 
5 111-183 
6 184-366 


NNNNR Re 


The data file nybight.dat contains data on the total catch for sampling stations visited 
in June 1974 and June 1975. 


a Construct side-by-side boxplots of the number of fish caught in the trawls in June, 
1974. Does there appear to be a large variation among the strata? 


b Calculate estimates of the average number and average weight of fish caught per 
haul in June 1974, along with the standard error. 


ce Calculate estimates of the average number and average weight of fish caught per 
haul in June 1975, along with the standard error. 


d_ Is there evidence that the average weight of fish caught per haul differ between 
June 1974 and June 1975? Answer using an appropriate hypothesis test. 


In January 1995, the Office of University Evaluation at Arizona State University 
surveyed faculty and staff members to find out their reaction to the closure of the 
university during Winter Break, 1994. Faculty and staff in academic units that 
were closed during the winter break were divided into four strata and 
subsampled. 
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Stratum Population Sample 
Number Employee Type Size (Np) Size (mp) 
1 Faculty 1374 500 
2 Classified staff 1960 653 
3 Administrative staff 252 74 
4 Academic professional 95 95 


Questionnaires were sent through campus mail to persons in strata | through 4; the 
sample size in the above table is the number of questionnaires mailed in each stratum. 
We'll come back to the issue of nonresponse in this survey in Chapter 8; for now, 
just analyze the respondents in the stratified sample of employees in closed units; the 
data are in the file “winter.dat.” For this exercise, look at the answers to the question 
“Would you want to have Winter Break Closure again?” (variable breakaga). 


a_Notall persons in the survey responded to the question. Find the number of persons 
that responded to the question in each of the four strata. For this exercise, use these 
values as the np. 


b Use (3.6) and (3.7) to estimate the proportion of faculty and staff that would 
answer yes to the question “Would you want to have Winter Break Closure again” 
and give the standard error. 


ec Create a new variable, in which persons who respond “yes” to the question take 
on the value 1, persons who respond “no” to the question take on the value 0, and 
persons who do not respond are either left blank (if you are using a spreadsheet) or 
assigned the missing value code (if you are using statistical software). Construct 
a column of sampling weights N;,/n;, for the observations in the sample. (The 
sampling weight will be 0 or missing for nonrespondents.) Now use (3.10) to 
estimate the proportion of faculty and staff that would answer yes to the question 
“Would you want to have Winter Break Closure again?” 


d Using the column of Os and Is you constructed in the previous question, find . 
for each stratum by calculating the sample variance of the observations in that 
stratum. Now use (3.5) to calculate the standard error of your estimate of the 
proportion. Why is your answer the same as you calculated in (b)? 


e Stratification is sometimes used as a method of dealing with nonresponse. 
Calculate the response rates (the number of persons who responded divided by 
the number of questionnaires mailed) for each stratum. Which stratum has the 
lowest response rate for this question? How does stratification treat the 
nonrespondents? 


The data in the file radon.dat were collected from 1003 homes in Minnesota in 
1987 (Tate, 1988) in order to estimate the prevalence and distribution of households 
with high indoor radon concentrations. The data are adapted from www.stat.berkeley. 
edu/users/statlabs/labs.html, the website for Nolan and Speed (2000). Since the inves- 
tigators were interested in how radon levels varied across counties, each of the 
87 counties in Minnesota served as a stratum. An SRS of telephone numbers from 
county telephone directories was selected in each county. When a household could 
not be contacted or was unwilling to participate in the study, an alternate 
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telephone number was used, until the desired sample size in the stratum was 
reached. 


a__ Discuss possible sources of nonsampling error in this survey. 


b Calculate the sampling weight for each observation, using the values for N;, and 
n,, in the data file. 


c ‘Treating the sample as a stratified random sample, estimate the average radon 
level for Minnesota homes, along with a 95% CI. Do the same for the response 
log(radon). 


d_ Estimate the total number of Minnesota homes that have radon level of 4 picocuries 
per liter (pCi/L) or higher, with a 95% CI. The U.S. Environmental Protection 
Agency (2007) recommends fixing your home if the radon level is at least 4 
pCi/L. 


C. Working with Theory 


Construct a small population and stratification for which V (ir) using proportional 
allocation is larger than the variance that would be obtained by taking an SRS with 
the same number of observations. H1nT: Use (3.11). 


A Stratified sample is being designed to estimate the prevalence p of a rare charac- 
teristic, say the proportion of residents in Milwaukee, Wisconsin, who have Lyme 
disease. Stratum 1, with N; units, has a high prevalence of the characteristic; stra- 
tum 2, with Nz units, has low prevalence. Assume that the cost to sample a unit (for 
example, the cost to select a person for the sample and determine whether he or she 
has Lyme disease) is the same for each stratum, and that at most 2000 units are to be 
sampled. 


a Let p; and pz be the proportions in stratum | and stratum 2 with the rare char- 
acteristic. If pj = 0.10, po = 0.03, and Ni /N = 0.4, what are n; and nz under 
optimal allocation? 

b If py = 0.10, p2 = 0.03, and N,/N = 0.4, what is V(Pstr) under proportional 
allocation? Under optimal allocation? What is the variance if you take an SRS of 
2000 units from the population? 


ce (Use a spreadsheet for this part of the exercise.) Now fix p = 0.05. Let p; range 
from 0.05 to 0.50, and NV, /N range from 0.01 to 0.50 (these two values then deter- 
mine the value of p2). For each combination of p; and N,/N, find the optimal 
allocation, and the variance under both proportional allocation and optimal allo- 
cation. Also find the variance from an SRS of 2000 units. When does the optimal 
allocation give a substantial increase in precision when compared to proportional 
allocation? When compared to an SRS? 


N! 
n\(N — n)! 
possible SRS’s of size n from a population of size N. Suppose we stratify the popula- 
tion into H strata, where each stratum contains N;, = N/H units. A stratified sample 
is to be selected using proportional allocation, so that n, = n/H. 


(Requires probability.) We know from Section 2.3 that there are (*) = 


a How many possible stratified samples are there? 
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b  Stirling’s formula approximates k!, when k is large, by k! * /2stk (k/e)*, where 
e = exp (1) © 2.718282. Use Stirling’s formula to approximate 


number of possible stratified samples of size n 


number of possible SRSs of size n 


(Requires calculus.) Show that the variance is minimized for a fixed cost with the 
cost function in (3.12) when n, « N,S;/./cn, a8 in (3.13). Hint: Use Lagrange 
multipliers. 


Under Neyman allocation, discussed in Section 3.4.2, the optimal sample size in 
stratum / is 


NiSh 
Nh,Neyman = H 


ny 
> NS; 
l=1 


a Show that the variance of fg, if N' eyman allocation is used is 


(72 2 04H 
Vireynanl Fats) = 5 (> vss) = So MnSh- 
h=1 h=1 


b We showed in Section 3.4.1 that the variance of fa, if proportional allocation is 
used is 


N H iol 
* 2 2 
Vorop (tstr) = a DMS _ DMS: 
1= = 


Prove that the theoretical variance from Neyman allocation is always less than or 
equal to the theoretical variance from proportional allocation by showing that 


2 
me A N? 2 Nh 2 a Nn 
Vorop (tstr) ~ Vueyman (str) = n > N Si > N Sh 


h=1 


2 
N? SM AN 
= S Si]. 
n Daw \* s Nn 
ec From (b), we see that the gain in precision from using Neyman allocation relative 


to using proportional allocation is higher if the stratum standard deviations S), 
vary widely. When H = 2, show that 


N\N2 


n 


(S1 — $2)*. 


Vorop (tstr) _ Veyman (fst) = 


(Requires calculus.) Suppose that there are K responses of interest, and response k has 
relative importance a, > 0, where yy ax = 1. Let ty, be the estimated population 
total for response k, and let nya be the population variance for response k in stratum 
h. Then the optimal allocation problem is to minimize 


K 
Yo aVG,) 
k=1 
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subject to the constraint C = co + ee NnCh. Show that the optimal allocation for 
fixed total sample size n gives 


sa I Si 


0 Eas 


D. Projects and Activities 


Rectangles. This activity continues Exercise 30 of Chapter 2. Divide the rectangles 
in the population of Figure 2.7 into two strata, based on your judgment of size. Now 
take a stratified sample of 10 rectangles. State how you decided on the sample size in 
each stratum. Estimate the total area of all the rectangles in the population, and give 
a 95% CI, based on your sample. How does your CI compare with that from the SRS 
in Chapter 2? 


Mutual funds. In Exercise 31 of Chapter 2, you took an SRS of funds from a mutual 
fund company. Most companies have mutual funds in a number of different categories, 
for example, domestic stock funds, foreign stock funds, and bond funds, and the 
returns in these categories differ. 


a__ Divide the funds from the company into strata. You may use categories provided 
by the fund company, or other categories such as market capitalization. Create a 
table of the strata, with the number of mutual funds in each stratum. 


b Using proportional allocation, take a stratified random sample of size 25 from 
your population. 


ce Find the mean and a 95% CI for the mean of the variable you studied in Exercise 3 1 
of Chapter 2. How does your estimate from the stratified sample compare with 
the estimate from the SRS you found earlier? 


The Consumer Bankruptcy Project of 2001 (Warren and Tyagi, 2003; data collection 
is described on pages 181-188) surveyed 2220 households who filed for Chapter 7 
or Chapter 13 bankruptcy, with the goal of studying why families file for bankruptcy. 
Questionnaires were given to debtors attending the mandatory meeting with the 
bankruptcy trustee assigned to their case in the five districts selected by the investi- 
gators for the study (these districts included the cities of Nashville, Chicago, Dallas, 
Philadelphia, and Los Angeles) on specified target dates. Additional samples were 
taken from two rural districts in Tennessee and Iowa. Quota sampling was used in 
each district, with the goal of collecting 250 questionnaires from each district that 
had the same proportions of Chapter 7 and Chapter 13 bankruptcies as were filed in 
the district. Discuss the relative merits and disadvantages of using quota sampling for 
this study. 


Read the article on estimating medical errors by Thomas et al. (2000). What is the 
purpose of this sample? How was stratification used in the survey design? Why do 
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TABLE 3.6 


Table for Exercise 32. 

A B Cc D E Other Total 
Apache 1 13 19 0 0 94 127 
Cochise 12 5 0 637 40 0 694 
Coconino 1 6 0 125 0 289 421 
Gila 0 2 51 151 0 0 204 
Graham 0) 2 0 63 0 143 208 
Greenlee 0) 0 0 58 0 0 58 
Maricopa 118 169 0 3,732 2,675 5,105 11,799 
Mohave 4 6 0) 44 0 476 530 
Navajo 2 Py) 132 124 0 0 263 
Pima 62 26 0 1,097 727 1,786 3,698 
Pinal 5 10 13 22 360 478 888 
Santa Cruz 0 5 0 118 150 0 273 
Yavapai 7 8 0 173 0 198 386 
Yuma =) 5 0 837 0 0 847 
LaPaz 0) 1 0 89 0 0 90 
Total 217 263 215 7,270 3,952 8,569 20,486 


you think the investigators chose the stratification variables they used? What are the 
possible sources of nonsampling error in this survey? 


The U.S. Monthly Retail Trade and Food Services program, described at www.census. 
gov/mrts/www/mrts.html, provides estimates of sales at retail and food service stores. 
Read the documentation on Sample Design and Estimation Procedures. How does the 
survey use stratification in the design? 


Suppose the Arizona Department of Health wishes to take a survey of 2-year-olds 
whose families receive medical assistance, to determine the proportion who have 
been immunized. The medical care is provided by several different health care orga- 
nizations, and the state has 15 counties. Table 3.6 shows the population number of 
2-year-olds for each county/organization combination. 

The sample is to be stratified by county and organization. It is desired to select 
sample sizes for each combination so that 


a___ the margin of error for estimating percentage immunized is 0.05 or less when the 
data are tabulated for each county (summing over all health care organizations) 

b the margin of error for estimating percentage immunized is 0.05 or less when the 
data are tabulated for each health care organization (summing over all counties) 

c at least two children (fewer, of course, if the cell does not have two children) are 
selected from every cell. 


Note that for this problem, as for many survey designs, many different designs would 
be possible. 
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Example 3.7 discussed the use of stratified sampling in mutual funds. 


a Locate an index fund or exchange traded fund that attempts to replicate an index. 
Summarize how they use stratified sampling. 


b Suppose you were asked to devise a stratified sampling plan to represent the 
Wilshire 5000 Index using market capitalization classes as strata. Investigate how 
the index is constructed. Using a list of the stocks in the index, construct strata 
and develop a stratified sampling design. 


Trucks. The Vehicle Inventory and Use Survey (VIUS) has been conducted by the U.S. 
government to provide information on the number of private and commercial trucks in 
each state. The stratified random sampling design is described in U.S. Census Bureau 
(2006b). For the 2002 survey, 255 strata were formed from the sampling frame of 
truck registrations using stratification variables state and trucktype. The 50 states plus 
the District of Columbia formed 51 geographic classes; in each, the truck registrations 
were partitioned into one of five classes: 


Pickups 
Minivans, other light vans, and sport utility vehicles 


Light single-unit trucks with gross vehicle weight less than 26,000 pounds 


PF 2S N = 


Heavy single-unit trucks with gross vehicle weight greater than or equal to 26,000 
pounds 


5. Truck-tractors 


Consequently, the full data set has 51 x 5 = 255 strata. Selected variables from the 
data are in the data file vius.dat. For each question below, give a point estimate and a 
95% CI. 


a Thesampling weights are found in variable tabtrucks and the stratification is given 
by variable stratum. Estimate the total number of trucks in the United States. 
(HINT: What should your response variable be?) Why is the standard deviation of 
your estimator essentially zero? 


b Estimate the total number of truck miles driven in 2002 (variable miles_annl). 


ce Estimate the total number of truck miles driven in each of the five trucktype 
classes. 


d_ Estimate the average miles per gallon (MPG) for the trucks in the population. 


Baseball data. Exercise 32 of Chapter 2 described the population of baseball players 
in data file baseball.dat. 


a_ Take a stratified random sample of 150 players from the file, using proportional 
allocation with the different teams as strata. Describe how you selected the sample. 


b Find the mean of the variable /ogsal, using your stratified sample, and give a 
95% CI. 


c Estimate the proportion of players in the data set who are pitchers, and give a 
95% CI. 


d How do your estimates compare with those of Exercise 32 from Chapter 2? 


e Examine the sample variances in each stratum. Do you think optimal allocation 
would be worthwhile for this problem? 
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Using the sample variances from (e) to estimate the population stratum variances, 
determine the optimal allocation for a sample in which the cost is the same in each 
stratum and the total sample size is 150. How much does the optimal allocation 
differ from the proportional allocation? 


Online bookstore. In Exercise 33 from Chapter 2 you took an SRS of book titles from 
amazon.com. Use the same book genre for this problem. 


Stratify the population into two categories: hardcover and paperback. You can 
obtain the population counts in the paperback category by refining your search to 
include the word paperback. 


Take a stratified random sample of 40 books from your population using propor- 
tional allocation. Record the price and number of pages for each book. 


Give a point estimate and a 95% CI for the mean price of books and the mean 
number of pages for books in the population. 


Compare your CI’s to those from Exercise 33 of Chapter 2. Does stratification 
appear to increase the precision of your estimate? 


Use your SRS from Chapter 2 to estimate the within-stratum variance of book 
price for each stratum. In this case, you are using the SRS as a pilot sample to help 
design a subsequent sample. Find the optimal allocation for a stratified random 
sample of 40 books. How does the optimal allocation differ from the proportional 
allocation? 


IPUMS exercises. Exercise 37 of Chapter 2 described the IPUMS data. 


Using one or more of the following variables: age, sex, race, or marstat, divide the 
population into strata. Explain how you decided upon your stratification variable 
and how you chose the number of strata to use. (Note: It is NOT FAIR to use the 
values of inctot in the population to choose your strata! However, you may draw 
a pilot sample of size 200 using an SRS to aid you in constructing your strata.) 


Using the strata you constructed, draw a stratified random sample using propor- 
tional allocation. Use the same overall sample size you used for your SRS in 
Exercise 37 of Chapter 2. Explain how you calculated the sample size to be drawn 
from each stratum. 


Using the stratified sample you selected with proportional allocation, estimate the 
total income for the population, along with a 95% CI. 


Using the pilot sample of size 200 to estimate the within-stratum variances, use 
optimal allocation to determine sample stratum sizes. Use the same value of n as 
in part 37b, which is the same n from the SRS in Exercise 37 of Chapter 2. Draw 
a stratified random sample from the population along with a 95% CI. 


Under what conditions can optimal allocation be expected to perform much better 
than proportional allocation? Do these conditions exist for this population? Com- 
ment on the relative performance you observed between these two allocations. 


Overall, do you think your stratification was worthwhile for sampling from this 
population? How did your stratified estimates compare with the estimate from the 
SRS you took in Chapter 2? If you were to start over on the stratification, what 
would you do differently? 
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Ratio and Regression Estimation 


The registers of births, which are kept with care in order to assure the condition of the citizens, can 
serve to determine the population of a great empire without resorting to a census of its inhabitants, an 
operation which is laborious and difficult to do with exactness. But for this it is necessary to know the 
ratio of the population to the annual births. The most precise means for this consists of, first, choosing 
subdivisions in the empire that are distributed in a nearly equal manner on its whole surface so as to 
render the general result independent of local circumstances; second, carefully enumerating the 
inhabitants of several communes in each of the subdivisions, for a specified time period; third, 
determining the corresponding mean number of annual births, by using the accounts of births during 
several years preceding and following this time period. This number, divided by that of the inhabitants, 
will give the ratio of the annual births to the population, in a manner that will be more reliable as the 
enumeration becomes larger. 


—Pierre-Simon Laplace, Essai Philosophique sur les Probabilités (trans. S. Lohr) 


France had no population census in 1802, and Laplace wanted to estimate the number 
of persons living there (Laplace, 1814; Cochran, 1978). He obtained a sample of 30 
communes spread throughout the country. These communes had a total of 2,037,615 
inhabitants on September 23, 1802. In the 3 years preceding September 23, 1802, a 
total of 215,599 births were registered in the 30 communes. Laplace determined the 
annual number of registered births in the 30 communes to be 215,599/3 = 71,866.33. 
Dividing 2,037,615 by 71,866.33, Laplace estimated that each year there was one 
registered birth for every 28.352845 persons. Reasoning that communes with large 
populations are also likely to have large numbers of registered births, and judging 
that the ratio of population to annual births in his sample would likely be similar to 
that throughout France, he concluded that one could estimate the total population of 
France by multiplying the total number of annual births in all of France by 28.352845. 
(For some reason, Laplace decided not to use the actual number of registered births in 
France in the year prior to September 22, 1802 in his calculation but instead multiplied 
the ratio by | million.) 

Laplace was not interested in the total number of registered births for its own 
sake but used it as auxiliary information for estimating the total population of France. 
We often have auxiliary information in surveys. In Chapter 3, we used such auxiliary 
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information in designing a survey. In this chapter, we use auxiliary information in the 
estimators. Ratio and regression estimation use variables that are correlated with the 
variable of interest to improve the precision of estimators of the mean and total of a 
population. 


Ratio Estimation in a Simple Random Sample 


EXAMPLE 4.1 


For ratio estimation to apply, two quantities y; and x; must be measured on each 
sample unit; x; is often called an auxiliary variable or subsidiary variable. In the 
population of size N 


N N 
ty = yi Lk = Dae: 
i=1 i=1 


and their ratio! is 


In the simplest use of ratio estimation, a simple random sample (SRS) of size n is 
taken, and the information in both x and y is used to estimate B, t,, or yy. 

Ratio and regression estimation both take advantage of the correlation of x and yin 
the population; the higher the correlation, the better they work. Define the population 
correlation coefficient of x and y to be 


N 
>| i — Fu) — Su) 
i=1 


(N — 1)S,S, 


R= 


(4.1) 


Here, S, is the population standard deviation of the x;’s, S, is the population standard 
deviation of the y;’s, and R is simply the Pearson correlation coefficient of x and y for 
the N units in the population. 


Suppose the population consists of agricultural fields of different sizes. Let 


y; = bushels of grain harvested in field i 


x; = acreage of field i 
Then 
B = average yield in bushels per acre 


yy = average yield in bushels per field 
t, = total yield in bushels. 


!Why use the letter B to represent the ratio? As we shall see in Section 4.6, ratio estimation is motivated 
by a regression model: Y; = 6x; + €;, with E[e;] = 0 and V[e;] = o2x;. Thus the ratio of ty and ty is 
actually a regression coefficient. 
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If an SRS is taken, natural estimators for B, t,, and yy are: 


faveo5 

x xX 
iy = Br, (4.2) 
yy = Biv, 


where ft, and xy are assumed known. 


41.1 Why Use Ratio Estimation? 


1 Sometimes we simply want to estimate a ratio. In Example 4.1, B—the average 
yield per acre—is of interest and is estimated by the ratio of the sample means 
B= y/x. If the fields differ in size, both numerator and denominator are random 
quantities; if a different sample is selected, both y and x are likely to change. In 
other survey situations, ratios of interest might be the ratio of liabilities to assets, 
the ratio of the number of fish caught to the number of hours spent fishing, or the 
per capita income of household members in Australia. 

Some ratio estimates appear disguised because the denominator looks like it is 
just a regular sample size. To determine whether you need to use ratio estimation 
for a quantity, ask yourself “If I took a different sample, would the denominator be 
a different number?” If yes, then you are using ratio estimation. Suppose you are 
interested in the percentage of pages in Good Housekeeping magazine that contain 
at least one advertisement. You might take an SRS of 10 issues from the most recent 
60 issues of the magazine and for each issue measure the following: 


x; = total number of pages in issue i 
y; = total number of pages in issue i 


that contain at least one advertisement. 


The proportion of interest can be estimated as 


™ 
<< 


B= 2. 


| 


The denominator is the estimated total number of pages in the 60 issues. If a 
different sample of 10 issues is selected, the denominator will likely be different. 
In this example, we have an SRS of magazine issues; we have a cluster sample 
(we briefly discussed cluster samples in Section 2.1) of pages from the most recent 
60 issues of Good Housekeeping. In Chapter 5, we shall see that ratio estimation 
is commonly used to estimate means in cluster sampling. 

Technically, we are using ratio estimation every time we take an SRS and esti- 
mate a mean or proportion for a subpopulation, as will be discussed in Section 4.2. 


2 Sometimes we want to estimate a population total, but the population size N is 
unknown. Then we cannot use the estimator t = Ny from Chapter 2. But we know 
that N = t,/xy and can estimate N by f,/x. We thus use another measure of size, 
t,, instead of the population count NV. 
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To estimate the total number of fish in a haul that are longer than 12 cm, 
you could take a random sample of fish, estimate the proportion that are larger 
than 12 cm, and multiply that proportion by the total number of fish, N. Such a 
procedure cannot be used if N is unknown. You can, however, weigh the total haul 
of fish, and use the fact that having a length of more than 12 cm (y) is related to 
weight (x), so 


typ =F 


The total weight of the haul, t,., is easily measured, and ¢,/x estimates the total 
number of fish in the haul. 


Ratio estimation is often used to increase the precision of estimated means and 
totals. Laplace used ratio estimation for this purpose in the example at the beginning 
of this chapter, and increasing precision will be the main use discussed in the 
chapter. 

In Laplace’s use of ratio estimation, 


y; = number of persons in commune i 


x; = number of registered births in commune i. 


Laplace could have estimated the total population of France by multiplying the aver- 
age number of persons in the 30 communes (y) by the total number of communes 
in France (NV). He reasoned that the ratio estimator would attain more precision: 
on average, the larger the population of a commune, the higher the number of 
registered births. Thus the population correlation coefficient R, defined in (4.1), is 
likely to be positive. Since y and x are then also positively correlated [see (A.11) 
in Appendix A], the sampling distribution of y/x will have less variability than the 
sampling distribution of y/xy. So if 


t, = total number of registered births 


is known, the mean squared error (MSE) of hye = Bry is likely to be smaller than the 
MSE of Ny, an estimator that does not use the auxiliary information of registered 
births. 


Ratio estimation is used to adjust estimates from the sample so that they reflect 
demographic totals. An SRS of 400 students taken at a university with 4000 students 
may contain 240 women and 160 men, with 84 of the sampled women and 40 of the 
sampled men planning to follow careers in teaching. Using only the information 
from the SRS, you would estimate that 


4 
a x 124 = 1240 
400 


students plan to be teachers. Knowing that the college has 2700 women and 1300 
men, a better estimate of the number of students planning teaching careers might 
be 


84 2700 + ox 1300 = 1270 
240 * 160 * eiysee 


Ratio estimation is used within each gender: In the sample, 60% are women, but 
67.5% of the population are women, so we adjust the estimate of the total number 


EXAMPLE 4.2 
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of students planning a career in teaching accordingly. To estimate the total number 
of women who plan to follow a career in teaching, let 


___ J 1 if woman and plans career in teaching 
%* = ) 0 otherwise. 


ate 1 if woman 
‘| 0 otherwise. 


Then (84/240) x 2700 = (y/x)t, is a ratio estimate of the total number of women 
planning a career in teaching. Similarly, (40/160) x 1300 is a ratio estimate of the 
total number of men planning a teaching career. 

This use of ratio estimation, called poststratification, will be discussed in 
Section 4.4 and Chapters 7 and 8. 


5 Ratio estimation may be used to adjust for nonresponse, as will be discussed in 
Chapter 8. Suppose a sample of businesses is taken; let y; be the amount spent on 
health insurance by business i and x; be the number of employees in business i. 
Assume that ¢,, the total number of employees in all businesses in the population, 
is known. We expect that the amount a business spends on health insurance will 
be related to the number of employees. Some businesses may not respond to the 
survey, however. One method of adjusting for the nonresponse when estimating 
total insurance expenditures is to multiply the ratio y/x (using data only from the 
respondents) by the population total t,. If companies with few employees are less 
likely to respond to the survey, and if y; is proportional to x;, then we would expect 
the estimate Ny to overestimate the population total 7,. In the ratio estimate t,y/x, 
t,/x is likely to be smaller than N because companies with many employees are 
more likely to respond to the survey. Thus a ratio estimate of total health care 
insurance expenditures may help to compensate for the nonresponse of companies 
with few employees. 


Let’s return to the SRS from the U.S. Census of Agriculture, described in Example 2.5. 
The file agsrs.dat contains data from an SRS of 300 of the 3078 counties. 

For this example, suppose we know the population totals for 1987, but only have 
1992 information on the SRS of 300 counties. When the same quantity is measured 
at different times, the response of interest at an earlier time often makes an excellent 
auxiliary variable. Let 


y; = total acreage of farms in county i in 1992 


x; = total acreage of farms in county i in 1987. 


In 1987 a total of t, = 964,470,625 acres were devoted to farms in the United States. 
The average acreage per county for the population is then xy = 964,470,625/3078 = 
313,343.3 acres of farms per county. The data, and the line through the origin with 
slope B, are plotted in Figure 4.1. 

A portion of a spreadsheet with the 300 values of x; and y; is given in Table 4.1. 
Cells C305 and D305 contain the sample averages of y and x, respectively, so 
y C305 
% 


= ags = 0.986565, 
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FIGURE 4.1 

The plot of acreage, 1992 vs. 1987, for an SRS of 300 counties. The line in the plot goes through 

the origin and has slope B= 0.9866. Note that the variability about the line increases with x. 
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Millions of Acres Devoted to Farms (1987) 


>, = Bxy = (B)B13,343.283) = 309,133.6, 
and 
i = Bt, = (B)(964,470,625) = 951,513,191. 


Note that y for these data is 297,897.0, so iysrs = (3078)(y) = 916,927,110. In 
this example, xs = 301,953.7 is smaller than xy = 313,343.3. This means that 
our SRS of size 300 slightly underestimates the true population mean of the x’s. 
Since the x’s and y’s are positively correlated, we have reason to believe that ys may 
also underestimate the population value yy. Ratio estimation gives a more precise 
estimate of yy by expanding yg by the factor xy /xs. Figure 4.2 shows the ratio and 
SRS estimates of yy ona graph of the center part of the data. m= 


412 Bias and Mean Squared Error of Ratio Estimators 


Unlike the estimators y and Ny in an SRS, ratio estimators are usually biased for 
estimating yy and ¢,. We start with the unbiased estimator y—if we calculate ys for 
each possible SRS S, then the average of all of the sample means from the possible 
samples is the population mean yy. The estimation bias in ratio estimation arises 


TABLE 41 
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Part of the Spreadsheet for the Census of Agriculture Data 
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A B Cc D E 
1| County State acres92 (y) acres87 (x) | Residual (y — Bx) 
2 
3| Coffee County AL 175209 179311 —1693.00 
4| Colbert County AL 138135 145104 —5019.56 
5| Lamar County AL 56102 59861 —2954.78 
6| Marengo County AL 199117 220526 —18446.29 
299) Rock County WI 343115 357751 —9829.70 
300} Kanawha County WV 19956 21369 —1125.91 
301} Pleasants County WV 15650 15716 145.14 
302} Putnam County WV 55827 55635 939.44 
303 
304} Column sum 89369114 90586117 3.96176E-09 
305 | Column average 297897.0467 | 301953.7233 
306| Column standard deviation 344551.8948 | 344829.5964 31657.21817 
307| B = C305/D305= 0.986565237 


FIGURE 4.2 


Detail of the center portion of Figure 4.1. Here, xy is larger than xs, so yy is larger than yg. 
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because y is multiplied by xy/x; if we calculate yr for all possible SRSs S, then the 
average of all the values of y, from the different samples will be close to yy, but will 
usually not equal yy exactly. 

The reduced variance of the ratio estimator usually compensates for the presence 
of bias—although EF [I # yy, the value of Vy for any individual sample is likely to 
be closer to yy than is the sample mean ys. After all, we take only one sample in 
practice; most people would prefer to be able to say that their particular estimate from 
the sample is likely to be close to the true value, rather than that their particular value 
of ys may be quite far from yz, but that the average deviation ys — yy, averaged 
over all possible samples S that could be obtained, is zero. For large samples, the 
sampling distributions of both y and y, will be approximately normal; if x and y 
are highly positively correlated, the following pictures illustrate the relative bias and 
variance of the two estimators: 


Sampling Distribution of y Sampling Distribution of y. 


To calculate the bias of the ratio estimator of yy, note that 


A 


A y_ - - x —xy - 
Yr — Yu = Xu — Yu =y(1- —z — yu. 
Xx Xx 


Since E[y] = yu, 
Bias (¥,) = ED,—Jul 
EU5] — ju — E| 2G — 30) 
= -£[BG —iv)| 
—Cov (B, x). (4.3) 


Consequently, as shown by Hartley and Ross (1954), 


IBias(3,)| __ |Cov(B,x)|_ [Corr (B, X)|/ V(B)VG) _W@ 


VV6) uv) iv VB) “7 


The absolute value of the bias of the ratio estimator is small relative to the standard 
deviation of the estimator if the coefficient of variation (CV) of x is small. For an 
SRS, CV (x) < Ce /(nxy), so that CV(x) decreases as the sample size increases. 


=CV(%). (4.4) 
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The result in Equation (4.3) is exact, but not necessarily easy to use with data. We 
now find approximations for the bias and variance of the ratio estimator that rely on 
the variances and covariance of x and y. These approximations are an example of the 
linearization approach to approximating variances, to be discussed in Section 9.1. We 
write 


x xy(y — Bx x—xX 
3, — ju = WO «= 5 - Bn (: -= ) ; (4.5) 
x x 
We can then show that (see Exercise 22) 
1 
— [BV(x) — CovG, y)] 
Xu 


=(1 ~) | (Bs? — RS,,) (4.6) 
Ni vag oe : 


Bias [y,] = Ely, — yu] © 


with R the correlation between x and y. The bias of >> is thus small if: 


= the sample size n is large 

= the sampling fraction n/N is large 

a Xy is large 

a S$, is small 

s the correlation R is close to 1. 

Note that if all x’s are the same value (S, = 0), then the ratio estimator is the same as 


the SRS estimator y and the bias is zero. 
For estimating the MSE of y,, (4.5) gives: 


Bee 2 
ElG, — uy] =E {0 — BY) (: a5 =) 
ee - 2 a = 
=£[6- avr +6-a0{(% =“) et], 
x x 


It can be shown (David and Sukhatme, 1974) that the second term is generally small 
compared with the first term, so the variance and MSE are approximated by 


MSE (),) = El — Ju)"] © E[G — Bx)’. (4.7) 
The term E [G - Bx)’ | can also be written as 


(4.8) 


n 


1 
E[@ — Bx) | =V |: ) (y; — Bx;) 


( a S? — 2BRS,Sy + B’S? 
icS 


(See Exercise 18.) r 
From (4.7) and (4.8), the approximated MSE of y, will be small when 


u the sample size n is large 
= the sampling fraction n/N is large 
«the deviations y; — Bx; are small 


a the correlation R is close to +1. 


EXAMPLE 4.3 
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In large samples, the bias of y y, is typically small relative to VG), so that MSE (3) © x 
V0y,) (see Exercise 21). Thus, in the following, we use V(y,) to estimate both the 
variance and the MSE. 

Note from (4.8) that E [6 _ Bx) | = V(d), where d; = y;— Bx; and dy = 0. Since 
the deviations d; depend on the unknown value B, define the new 
variable 

C= Jig Bx, 


which is the ith residual from fitting the line y = Bx. Estimate VG,) by 


¥60= (1-4) (Z) & “9 


where s? is the sample variance of the residuals ¢;: 


1 
ye 
e n— 1 L 

ieS 


[Exercise 19 explains why we include the factor xy/x in (4.9). In large samples, we 


expect xy /x to be approximately equal to 1.] Similarly, 


ALA n x 
1) = (1 = =) ms (4.10) 
and 
nt ae n\ (t\? 22 
HG) = 1B) = (1-5) (<) = (4.11) 


If the sample sizes are sufficiently large, approximate 95% confidence intervals 
(CIs) can be constructed using the standard errors (SEs) as 


B+1.96SE(B), y-+1.96SEG,), or i, + 1.96 SE(,). 


Some software packages, including SAS software, substitute a ¢ percentile with n — 1 
degrees of freedom for the normal percentile 1.96. 


Let’s return to the sample taken from the Census of Agriculture in Example 4.2. In the 
spreadsheet in Table 4.1, we created Column E, containing the residuals e; = y; — Bx;. 
The sample standard deviation of Column E was calculated in Cell E306 to be s. = 
31,657.218. Thus, using (4.11), 


ze | 300 /313,343.283 \ 31,657.218 
SE(t-) = 3078,/ 1 = 5,546,162. 
3078 \ 301,953.723 /300 


An approximate 95% CI for the total farm acreage, using the ratio estimator, is 


951,513,191 + 1.96(5,546,162) = [940,642,713, 962,383,669]. 


EXAMPLE 4.4 
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The website gives SAS code for calculating the ratio B= y/x and its standard 
error, with the output: 


Ratio Analysis: acres92/acres87 


Numerator Denominator Ratio Std Err 95% CL for Ratio 
acres92 acres87 0.986565 0.005750 0.97524871 0.99788176 


We then multiply each quantity in the output by ft, = 964,470,625 (we do the 
calculations on the computer to avoid roundoff error) to obtain ig = (964,470,625) 
(0.986565237) = 951,513,191 and 95% CI for t, of [940,598,734, 962,427,648]. SAS 
PROC SURVEYMEANS uses the percentile from a f99 distribution, 1.96793, instead 
of the value 1.96 from the normal distribution, so the CI from SAS software is slightly 
larger than the one we obtained when doing calculations by hand. 

Did using a ratio estimator for the population total improve the precision in this 
example? The standard error of i; = Ny is more than 10 times as large: 


- 150 Ss; 
SE(Vy) = 3078 _—————— - = 58,169,381. 
wy ( sam) Jis0 


The estimated CV for the ratio estimator is 5,546, 162/951,513,191 =0.0058, as com- 
pared with an estimated CV of 0.0634 for the SRS estimator Ny which does not use 
the auxiliary information. Including the 1987 information through the ratio estimator 
has greatly increased the precision. If all quantities to be estimated were highly cor- 
related with the 1987 acreage, we could dramatically reduce the sample size and still 
obtain high precision by using ratio estimators rather than Ny. = 


Let’s take another look at the hypothetical population used in Example 2.2 and in 
Exercise 2 of Chapter 3. Now, though, instead of using x as a stratification variable 
in stratified sampling, we use it as auxiliary information for ratio estimation. The 
population values are the following: 


Unit Number 


= 
< 


CIADNABRWNHe 
AA AMODAU 
oN NYA RD H 
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TABLE 4.2 
Sampling Distribution for #,,. 


Sample Number Sample, S XS Ys B tsps i 
1 {1,2,3,4} 5.00 2.15 0.55 22.00 25.85 
2 {1,2,3,5} 5.50 3.50 0.64 28.00 29.91 
3 {1,2,3,6} 5.25 3.50 0.67 28.00 31.33 
4 {1,2,3,7} 5.25 3.50 0.67 28.00 31.33 
67 {4,5,6,8} 6.50 6.50 1.00 52.00 47.00 
68 {4,5,7,8} 6.50 6.50 1.00 52.00 47.00 
69 {4,6,7,8} 6.25 6.50 1.04 52.00 48.88 
70 {5,6,7,8} 6.75 725 1.07 58.00 50.48 


Note that x and y are positively correlated. We can calculate population quantities 
since we know the entire population and sampling distribution: 


ty =47 ty = 40 
Sy = 1.3562027 Sy = 2.618615 
R = 0.6838403 B = 0.8510638 


Part of the sampling distribution for bie for a sample of size n = 4 is given in 
Table 4.2; the full file for the possible samples is in file artifratio.dat. Figure 4.3 gives 
histograms for the sampling distributions of two estimators of f,: tsps = Ny, the 
estimator used in Chapter 2, and es The sampling distribution for the ratio estimator 
is not spread out as much as the sampling distribution for Vy; it is also skewed rather 
than symmetric. The skewness leads to the slight estimation bias of the ratio estimator. 
The population total is t, = 40; the mean value of the sampling distribution of lg is 
39.85063. 

The mean value of the sampling distribution of B is 0.8478857, so Bias(B) = 
—0.003178. The approximate bias of B, calculated by substituting the population 
quantities into (4.6) and noting from (4.2) that B= y, /Xu, is 


1 
(1 ~) —~ (BS? — RS,Sy) = —0.003126. 
N/ nxy : ° 


The variance of the sampling distribution of B, calculated using the definition of 
variance in (2.5), is 0.015186446; the approximation using (4.8) is 

4 1 

8 (4)(5.875)2 


(S? — 2BRS,Sy + BS?) = 0.01468762. 


Example 4.4 demonstrates that the approximation to the MSE in (4.8) is in fact 
only an approximation; it happens to be a good approximation in that example even 
though the population and sample are both small. 


Relative Frequency 


FIGURE 4.3 


Sampling distributions for (a) tsps and (b) bye 
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For (4.7) to be a good approximation to the MSE, the bias should be small and 
the terms discarded in the approximation of the variance should be small. If the CV 
of x is small—that is, if xy is estimated with high relative precision, the bias is small 
relative to the square root of the variance. If we form a CI using by + 1.96 SE(tyl, 
using (4.11) as the standard error, then the bias will not have a great effect on the 
coverage probability of the CI. A small CV (x) also means that x is stable from sample 
to sample. In more complex sampling designs, though, the bias may be a matter of 
concern—we return to this issue in Section 4.5 and Chapter 9. For the approximation 
in (4.7) to work well, we want the sample size to be large, and CV (x) < .1, and 
CV (y) < .1. If these conditions are not met, then (4.7) may severely underestimate 
the true MSE. 


41.3 Ratio Estimation with Proportions 


EXAMPLE 4.5 


Ratio estimation works exactly the same way when the quantity of interest is a 
proportion. 


Peart (1994) collected the data shown in Table 4.3 as part of a study evaluating the 
effects of feral pig activity and drought on the native vegetation on Santa Cruz Island, 
California. She counted the number of woody seedlings in pig-protected areas under 
each of ten sampled oak trees in March 1992, following the drought-ending rains of 
1991. She put a flag by each seedling, then determined how many were still alive in 
February 1994. The data (courtesy of Diann Peart) are plotted in Figure 4.4. 

When most people who have had one introductory statistics course see data like 
these, they want to find the sample proportion of the 1992 seedlings that are still 
alive in 1994, and then calculate the standard error as though they had an SRS of 206 
seedlings, obtaining a value of ./(0.2961)(0.7039)/206 = 0.0318. This calculation 
is incorrect for these data since plots, not individual seedlings, are the sampling units. 
Seedling survival depends on many factors such as local rainfall, amount of light, 
and predation. Such factors are likely to affect seedlings in the same plot to a similar 
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TABLE 4.3 
Santa Cruz Island Seedling Data 


Pe y — 
Tree Number of Seedlings, 3/92 Seedlings Alive, 2/94 
1 1 0) 
2 0 0 
3 8 1 
4 2 2 
5 716 10 
6 60 15 
7 25 3 
8 2 2 
9 1 1 
10 31 27 
Total 206 61 
Average 20.6 6.1 
Standard deviation 27.4720 8.8248 


FIGURE 4.4 
Plot of number of seedlings that survived (February 1994) vs. seedlings alive (March 1992), 
for ten oak trees. 

30 


25 |- 


Seedlings That Survived (February 1994) 


Seedlings Alive (March 1992) 


degree, leading different plots to have, in general, different survival rates. The sample 
size in this example is 10, not 206. 

The design is actually a cluster sample; the clusters are the plots associated with 
each tree, and the observation units are individual seedlings in those plots. To look at 
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this example from the framework of ratio estimation, let 


y; = number of seedlings near tree i that are alive in 1994 


x; = number of seedlings near tree i that are alive in 1992. 


Then the ratio estimate of the proportion of seedlings still alive in 1994 is 


B=p yo 0.2961 
=e MG 


Using (4.10) and ignoring the finite population correction (fpc), 


seey= 1 Dies (vi — 0.2961165x;) 
~ V (10) (20.6)2 9 


_ | 56.3778 
~ V (10) (20.6)2 


= 0.115. 


The approximation to the variance of B in this example may be a little biased 
because the sample size is small. Note, however, the difference between the correctly 
calculated standard error of 0.115, and the incorrect value 0.0318 that would be 
obtained if one erroneously pretended that the seedlings were selected in an SRS. 

SAS code for calculating these estimates is given on the website. The relevant 
output from PROC SURVEYMEANS is below: 


Statistics 
Std Error 
Variable Mean of Mean 95% CL for Mean 
seedg94 6.100000 2.790659 -0.2129093 12.4129093 
seed92 20.600000 8.687411 0.9477108 40.2522892 


Ratio Analysis: seed94/seed92 


Numerator Denominator Ratio Std Err 95% CL for Ratio 


seedg94 seed92 0.296117 0.115262 0.03537532 0.55685769 


414 Ratio Estimation Using Weight Adjustments 


In Section 2.4, we defined the sampling weight to be w; = 1/z;, and wrote the 
estimated population total as a function of the observations y; and weights w;: 


h= > Wii- 


ieS 


EXAMPLE 4.6 
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Note that 
A ty a ty 
thr = rh =e > WiYi- 
by fy icS 
We can think of the modification used in ratio estimation as an adjustment to each 
weight. Define 


os 
ca 


Si 


> 
tay 


Then 
typ = > wigiyi- (4.12) 


icS 
The estimator i is a weighted sum of the observations, with weights w; = wig;. 
Unlike the original weights w;, however, the adjusted weights w; depend upon values 
from the sample: If a different sample is taken, the weight adjustment g; = t,/7, will 
be different. 
The weight adjustments g; calibrate the estimates on the x variable. Since 


> Wi8ixXi = tx, 

icS 
the adjusted weights force the estimated total for the x variable to equal the known 
population total f,. The factors g; are called the calibration factors. 


The variance estimators in (4.9) and (4.11) can be calculated by forming the new 
variable u; = g;e;. Then, for an SRS, 


i) = (1 a a 1) dw wy ~ (1 7 a (£) ~ VOr) 


icS 


and, similarly, V(2,) = V(y,). 


For the Census of Agriculture data used in Examples 4.2 and 4.3, g; = ty/t= 
964,470,625/929,413,560 = 1.037719554 for each observation. Since the sample has 
t, < t,, each observation’s sampling weight is increased by a small amount. The 
sampling weight for the SRS design is w; = 3078/300 = 10.26, so the ratio adjusted 
weight for each observation is 


w; = wig; = (10.26)(1.037719554) = 10.64700262. 
Then 
Y> wigixi = )* 10.64700262 x; = 964,470,625 = t, 
ieS ieS 
and 
>> wigiyi = > 10.64700262 y; = 951,513,191 = fy. 
ieS ieS 
The adjusted weights, however, no longer sum to N = 3078: 


¥ wig; = (300)(10.64700262) = 3194. 
icS 


4.2 Estimation in Domains \33 


Thus, the ratio estimator is calibrated to the population total t, of the x variable, but 
is no longer calibrated to the population size N. = 


415 Advantages of Ratio Estimation 


4) 


Ratio estimation is motivated by the desire to use information about a known auxiliary 
quantity x to obtain a more accurate estimator of ty OF yu. If x and y are perfectly 
correlated, that is, y; = Bx; for alli = 1,...,N, then hes = t, and there is no estimation 
error. In general, if y; is roughly praporlonal to xj, the MSE will be small. 

When does ratio estimation help? If the deviations of y; from Bx; are smaller than 
the deviations of y; from y, then V0,) < V0). Recall from Chapter 2 that 


2 ' n\ 8 
MSE() = VG) = (1 = ~) a 


Using the approximation in (4.7) and (4.8), 
MSEG,) (1 = =) (8? = 2BRS.8) 4 875"), 
Thus, 


MSE(y,) — MSE(j) © (1 = x) (Sy — 2BRS,Sy + BS, — Sy) 


1 
n 
= (1 : ) a B( — 2RS, + BS) 


so to the accuracy of the approximation, 


MSE(j,) < < MSE(y) if and only if R > PcG : 
~ 2S, ~ 2CV(y) 
If the CVs are approximately equal, then it pays to use ratio estimation when the 
correlation between x and y is larger than 1/2. 

Ratio estimation is most appropriate if a straight line through the origin sum- 
marizes the relationship between x; and y; and if the variance of y; about the line is 
proportional to x;. Under these conditions, B is the weighted least squares regression 
slope for the line through the origin with weights proportional to 1 /x;—the slope B 
minimizes the sum of squares 


Ie “Oi - Buy 


ies * 


Estimation in Domains 


Often we want separate estimates for subpopulations; the subpopulations are called 
domains or subdomains. We may want to take an SRS of 1000 people from a popu- 
lation of 50,000 people and estimate the average salary for men and the average salary 
for women. There are two domains: men and women. We do not know which persons 
in the population belong to which domain until they are sampled, though. Thus, the 
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number of persons in an SRS who fall into each domain is a random variable, with 
value unknown at the time the survey is designed. 

Estimating domain means is a special case of ratio estimation. Suppose there are 
D domains. Let U/q be the index set of the units in the population that are in domain 
d, and let S, be the index set of the units in the sample that are in domain d, for 
d = 1,2,...,D. Let Ny be the number of population units in U/z, and ng be the 
number of sample units in Sy. Suppose we want to estimate the mean salary for the 
domain of women, 


Vi 


icy total salary for all women in population 


Yu = 
av Na number of women in population 


A natural estimator of yy, is 


yi 


ieSq total salary for women in sample 


Ya = : 
Nd number of women in sample 


which looks at first just like the sample means studied in Chapter 2. 

The quantity ng is arandom variable, however: If a different SRS is taken, we will 
very likely have a different value for ng, the number of women in the sample. To see 
that yg uses ratio estimation, let 

Py ie 1 ifielg 
')0  ifi Ug, 
eee Vi ifie ly 
re VO iP eLL. 
Then t, = 9% = Na, tu = Na/N, te = YL, Mi Fug = ta/te = B,X = ng/n, and 
tu 


ty 


SI) 1 


Yq = B= 
Because we are estimating a ratio, we use (4.10) to calculate the standard error: 


a (u; — Bx;)° 


SE(z) = ( (1 ~) 1 ies 


N/ nx? n—1 
>: — BY 

_ (1 ") 1 ics, 

~\ N/ nx? n-1 
n n (ng — 1s, 

= (1 x) oe eg (4.13) 

d 
where 
> oi - 5a? 
i - i€éSq 


nq — | 


EXAMPLE 4.7 
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is the sample variance of the sample observations in domain d. If E(nq) is large, then 
(ng — 1)/ng © 1 and 


n <7 

SE (ja) © (1 = ~) _ 

(Ya) er 

In a sufficiently large sample, the standard error of yg is approximately the same as if 
we used formula (2.12). 

The situation is a little more complicated when estimating a domain total. If Ny 

is known, we can estimate t,, by Naya. If Na is unknown, though, we need to estimate 

t, by 


bg =a NG. 


The standard error is 


nN = ny s2 
SE(,2) = N SE(i) = N (1 2 =) “t. 


In the SRS of size 300 from the Census of Agriculture (see Example 2.5), 39 counties 
are in western states.” What is the estimated total number of acres devoted to farming 
in the West? 

The sample mean of the 39 counties is yy = 598,681, with sample standard 
deviation sy = 516,157.7. Using (4.13), 


300 \ 300 38 516,157.7 
SE (54) =./(1 = 77,637. 
3078) 39 299 ./39 


Thus, CV [ya] = 0.1297, and an approximate 95% CI for the mean farm acreage for 
counties in the western United States is [445,897, 751,463]. 

For estimating the total number of acres devoted to farming in the West, suppose 
we do not know how many counties in the population are in the western United States. 
Define 


a 1, if county 7 is in the western United States 
"~~ )0, otherwise 


and define u; = y;x;.Then 


a 3078 
== y 300 ti = 239. 556,051. (4.14) 
ieS 


The standard error is 


= 46,090,460. 


; 300 \ 273005.4 
SE(i,a) = 3078 (1 ) 


3078 /300 


The estimated CV for tya iS CV [ia] = 46,090,460/239,556,051 = 0.1924; had we 
known the number of counties in the western United States and been able to use that 


2Alaska (AK), Arizona (AZ), California (CA), Colorado (CO), Hawaii (HI), Idaho (ID), Montana (MT), 
Nevada (NV), New Mexico (NM), Oregon (OR), Utah (UT), Washington (WA), and Wyoming (WY). 
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value in the estimate, the CV for the estimated total would have been 0.1297, the CV 


for the mean. 


The SAS program on the website also contains the code for finding domain esti- 
mates. We define the domain indicator west to be | if the county is in the West and 0 
otherwise. The relevant output is 


Domain Analysis: 


west 


95% CL for Mean 


Std Error 

Mean of Mean 
252952 16834 
598681 77637 


219825.176 286079.583 
445897.252 751463.927 


west Variable 
0 acres92 
Z acres92 

west Variable 
0 acres92 


1 acres92 


Sum Std Dev 
677371058 47317687 
239556051 46090457 


west 
95% CL for Sum 
584253179 770488938 
148853274 330258829 


EXAMPLE 4.8 


The output gives the estimates and CIs for both domains. us 


An SRS of 1500 licensed boat owners in a state was sampled from a list of 400,000 
names with currently licensed boats; 472 of the respondents said they owned an open 
motorboat longer than 16 feet. The 472 respondents with large motorboats reported 
having the following numbers of children: 


Number of Number of 
Children Respondents 
0 76 
1 139 
2 166 
3 63 
4 19 
5 5 
6 3 
8 1 
Total 472 


If we are interested in characteristics of persons who own large motorboats, there 
are two domains: persons who own large motorboats (domain 1) and persons who do 
not own large motorboats (domain 2). To estimate the percentage of large-motorboat 
owners who have children, we can use pj = 396/472 = 0.839. This is a ratio 
estimator, but in this case, as shown in (4.13), the standard error is approximately 


4.2 Estimation in Domains — il 


what you would think it would be. Ignoring the fpc, 


0.839(1 — 0.839 
SE(p)) = Se = 0.017. 


To look at the average number of children per household among registered boat owners 
who register a motorboat more than 16 feet long, note that the average number of 
children for the 472 respondents in the domain is 1.667373, with variance 1.398678. 
Thus an approximate 95% CI for the average number of children in large-motorboat 


households is 
/ 1.398678 
1.667 + 1. ——. = [1.56, 1.77]. 
66 96 rap) [1.56 ] 


To estimate the total number of children in the state whose parents register a large 
motorboat, we create a new variable u for the respondents that equals the number of 
children if the respondent has a motorboat, and 0 otherwise. The frequency distribution 
for the variable u is then 


Number of Number of 
Children Respondents 
0 1104 
1 139 
2 166 
3 63 
4 19 
5 5 
6 3 
8 1 
Total 1500 


Now u = 0.52466 and se = 1.0394178, so ba = 400,000(0.524666) = 209,867 
and 


‘ 1500 1.0394178 
E (ta) = SE(t,) = 1 — ——— ] (400, ?—_____ = 10,510. 
SE (ha) = SE (t,) /( cama)! 00,000) 1500 0,510 


The variable u; counts the number of children in household i that belong to a household 
with a large open motorboat. SAS code to find estimates for this example is given on 
the website. = 


In this section, we have shown that estimating domain means is a special case of 
ratio estimation because the sample size in the domain varies from sample to sample. 
If the sample size for the domain in an SRS is sufficiently large, we can use SRS 
formulas for inference about the domain mean. 

Inference about totals depends on whether the population size of the domain, 
Na, is known. If Ng is known, then the estimated total is Naya. If Ng is unknown, 
then define a new variable u; that equals y; for observations in the domain and 0 for 
observations not in the domain; then use f,, to estimate the domain total. 


4) 


138 Chapter 4: Ratio and Regression Estimation 


The results of this section are only for SRSs, and the approximations depend on 
having a sufficiently large sample so that E(nq) is large. In Section 14.2, we discuss 
estimating domain means and totals if the data are collected using other sampling 
designs, or when the domain sample sizes are small. 


Regression Estimation in Simple Random 


Sampling 


43.1 Using a Straight-Line Model 


Ratio estimation works best if the data are well fit by a straight line through the origin. 
Sometimes, data appear to be evenly scattered about a straight line that does not go 
through the origin—that is, the data look as though the usual straight-line regression 
model 


y=Bot+ Bx 


would provide a good fit. 
Let B, and Bo be the ordinary least squares regression coefficients of the slope 
and intercept. For the straight line regression model, 


> - Oi -PD 
B, _ icS = TSy 
Yo i — x" Sr 
icS 
Bo = y — Bix, 


and r is the sample correlation coefficient of x and y. 

In regression estimation, like ratio estimation, we use the correlation between x 
and y to obtain an estimator for yy with (we hope) increased precision. Suppose we 
know xy, the population mean for the x’s. Then the regression estimator of yy is the 
predicted value of y from the fitted regression equation when x = xy: 


Yreg = Bo + Bixy = ¥ + BiGy — 3B. (4.15) 


If x from the sample is smaller than the population mean xy and x and y are positively 
correlated, then we would expect y to also be smaller than yy. The regression estimator 
adjusts y by the quantity By (%y — 2). 

Like the ratio estimator, the regression estimator is biased. Let B; be the least 
squares regression slope calculated from all the data in the population, 


N 
Y> @i — kv)0i — Fu) 


i=1 y 
B => => = 


N 
>> Gi — ul? 
i=1 


EXAMPLE 49 
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Then, using (4.15), the bias of vise is given by 
ELSreg — Ju] = Ely — ¥y) + ElBi &y — X)] = — Cov(By, x). (4.16) 


If the regression line goes through all of the points (x;, y;) in the population, then the 
bias is zero: in that situation, B, = B, for every sample, so Cov(B,, x) = 0. As with 
ratio estimation, for large SRSs the MSE for regression estimation is approximately 
equal to the variance (see Exercise 29); the bias is often negligible in large samples. 

The method used in approximating the MSE in ratio estimation can also be applied 
to regression estimation. Let d; = y; — [yy + Bi@j — xy)]. Then, 

MSE( reg) = ELS + Bi uy — ¥) — Su} 
= V(d) 


= (1 = =) Sa. (4.17) 


Using the relation B} = RS,/S,, it may be shown that 
ny Si ny 1 Oi — Su — Bil — Xu)” 
Ce era ae 
N/ n N/n = N-1 


= (1-7) “sta —R, (4.18) 


(See Exercise 28.) Thus, the approximate MSE is small when 


=» nis large 
a n/N is large 
a Sy is small 


a thecorrelation R is close to —1 or +1. 


The standard error may be calculated by substituting estimates for the population 
quantities in (4.17) or (4.18). We can estimate ss in (4.17) by using the residuals 
ei = Vi - (Bo + Byx;); then = ves e?/(n — 1) estimates Ne and 

SEC i.” )\= 
Ging) =  (1- 5) =. 
In small samples, we may alternatively calculate x using the MSE from a regression 
analysis: s? = > icS e? /(n— 2). This adjusts the estimator for the degrees of freedom 
in the regression. To estimate the variance using the formulation in (4.18), substitute 
the sample variance s and the sample correlation r for the population quantities 5 
and R, obtaining 


(4.19) 


SEGreg) = ‘ (1 = =) 21 =F), (4.20) 


To estimate the number of dead trees in an area, we divide the area into 100 square 
plots and count the number of dead trees on a photograph of each plot. Photo counts 
can be made quickly, but sometimes a tree is misclassified or not detected. So we 
select an SRS of 25 of the plots for field counts of dead trees. We know that the 
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FIGURE 4.5 

The plot of photo and field tree-count data, along with the regression line. Note that ae is the 
predicted value from the regression equation when x = xy. The point (x, y) is marked by “+” 
on the graph. 


18 - ° 


Field Count of Dead Trees 


| xu e— | \ 
6 8 10 12 14 16 
Photo Count of Dead Trees 


population mean number of dead trees per plot from the photo count is 11.3. The 
data—plotted in Figure 4.5—are given below. 


Photo(x) 10 12 7 13 13 6 17 «+116 «15~«10)0614 «£12 )~«(10 


Field(Qy) 15 14 9 14 8 5 18 15 #13 #15 «It 150612 


Photo (x) 5 12 10 10 9 6 11 7 9 11 10 10 


Field (y) 8 13 9 11 12. 9 12 13 11 10 9 8 


For these data, x = 10.6, y = 11.56, ss = 9.09, and the sample correlation 
between x and y is r = 0.62420. Fitting a straight line regression model gives 


¥ = 5.059292 + 0.613274x 


with By = 5.059292 and B; = 0.613274. In this example, x and y are positively 
correlated so that x and y are also positively correlated. Since x < xy, we expect 
that the sample mean y is also too small; the regression estimate adds the quantity 
Bi&u — x) = 0.613(11.3 — 10.6) = 0.43 to y to compensate. 

Using (4.15), the regression estimate of the mean is 


Sees = 5.059292 + 0.613274(11.3) = 11.99. 


Parameter 


Total field 
trees 

Mean field 
trees 
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From (4.20), the standard error is 


in 25 
SE(Vreg) = ne - a) (9.09)(1 — 0.624207) = 0.408. 


The standard error of Vi is less than that for y: 
SE[y] = 1 = ss = 0.522 
= 100) 250-7 


We expect regression estimation to increase the precision in this example because the 
variables photo and field are positively correlated. To estimate the total number of 
dead trees, use 


iyreg = (100)(11.99) = 1199; 
SE(yreg) = (100)(0.408) = 40.8. 


In SAS software, PROC SURVEYREG calculates regression estimates. Code for 
this example is given on the website; partial output is given below. 


Analysis of Estimable Functions 


Standard 95% Confidence 

Estimate Error t Value Pr > |t| Interval 
1198.92920 42.7013825 28.08 <.0001 1110.79788 1287.06053 
11.98929 0.4270138 28.08 <.0001 11.10798 12.87061 


The standard errors given by SAS software are slightly larger than those obtained by 
hand calculation because SAS uses a slightly different estimator (see Section 11.7). In 
practice, we recommend using survey regression software for regression estimation 
to avoid roundoff errors. = 


432 Difference Estimation 


Difference estimation is a special case of regression estimation, used when the inves- 
tigator “knows” that the slope B, is 1. Difference estimation is often recommended 
in accounting when an SRS is taken. A list of accounts receivable consists of the 
book value for each account—the company’s listing of how much is owed on each 
account. In the simplest sampling scheme, the auditor scrutinizes a random sample 
of the accounts to determine the audited value—the actual amount owed—in order 
to estimate the error in the total accounts receivable. The quantities considered are: 


y; = audited value for company i 


x; = book value for company i. 


Then, y — x is the mean difference for the audited accounts. 


44 
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The estimated total difference is : -—i,=N (y — x); the estimated audited value 
for accounts receivable is 


tyaite = te + (fy — fy). 
The residuals from this model are e; = y; — x;. The variance of tyaite is 
Vivaitt) =V[t+ Gs _ a) = Vii), 


where 7, = (N /n) ue gs éi- If the variability in the residuals e; is smaller than the 
variability among the y;, then difference estimation will increase precision. 

Difference estimation works best if the population and sample have a large fraction 
of nonzero differences that are roughly equally divided between overstatements and 
understatements, and if the sample is large enough so that the sampling distribution 
of (y — x) is approximately normal. In auditing, it is possible that most of the audited 
values in the sample are all the same as the corresponding book value. In such a 
situation, the design-based variance estimator is unstable and a model-based approach 
may be preferred. 


Poststratification 


Suppose a sampling frame lists all households in an area, and you would like to 
estimate the average amount spent on food in a month. One desirable stratification 
variable might be household size because large households might be expected to have 
higher food bills than smaller households. From U.S. census data, the distribution of 
household size in the region is known: 


Number of Persons Percentage 
in Household of Households 
1 25.75 
2 31.17 
3 17.50 
4 15.58 
5 or more 10.00 


The sampling frame, however, does not include information on household size— 
it only lists the households. Thus, although you know the population size in each 
subgroup, you cannot take a stratified sample because you do not know the stratum 
membership of the units in your sampling frame. You can, however, take an SRS and 
record the amount spent on food as well as the household size for each household in 
your sample. If n, the size of the SRS, is large enough, then the sample is likely to 
resemble a stratified sample with proportional allocation: We would expect about 26% 
of the sample to be one-person households, about 31% to be two-person households, 
and so on. 

Considering the different household-size groups to be different domains, we can 
use the methods from Section 4.2 to estimate the average amount spent on groceries 
for each domain. Take an SRS of size n. Let nj,m2,..., nz be the numbers of units in 
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the various household-size groups (domains) and let y;,..., 4 be the sample means 
for the groups. In this case, since the poststrata are formed after the sample is taken, 
the sample domain sizes 11, 2,...,” are random quantities. If we selected another 
SRS from the population, the poststratum sizes in the sample would change. Since 
the poststratum sizes in the population are known, however, we can use the known 
values of N; in the estimation. 

To see how poststratification fits in the framework of ratio estimation, define 
Xin = 1 if observation i is in poststratum A and 0 otherwise. Let uj, = yj,xj,. Then 
ia = spe Xin = Nn and 


N 
tun = 2 Uin = population total of variable y in poststratum h. 


i=1 
For each poststratum h, we can estimate the total in the poststratum by 

x N 

tuh = Uy; 

uh » = ih 
ieS 

[7m is the domain total estimator in (4.14)]. We can then use ratio estimation to obtain: 

tin» N hen 


tuhr — x—ltuh — = tun => Nayn> 
xh h 


where yy, is the sample mean of the observations in poststratum h. 
The poststratified estimator of the population total is 


H 


A A 
x e Nn. - 
typost = ) tune = ) bun = ) Niyns 
N h=1 


h=1 h=l “Yh 


ratio estimation is used within each poststratum to estimate the population total in 
that poststratum. 
The poststratified estimator of yy is 


= Nh Ee 
Ypost = > Wot (4.21) 


If N,/N is known, ny, is reasonably large (> 30 or so), and n is large, then we can 
use the variance for proportional allocation as an approximation to the poststratified 
variance: 


H 


rays n Np 82 
VGoost) © (1-=) een 4.22 
(post) > ae (4.22) 


This approximation is valid only when the expected sample sizes in each poststratum 
are large, however (see Exercise 37). 

Many large surveys use poststratification to improve efficiency of the estimators or 
to correct for the effects of differential nonresponse in the poststrata (see Chapter 8). 
We discuss poststratification for general survey designs in Section 11.7. 
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4) 


Ratio Estimation with Stratified Samples* 


The previous sections proposed ratio and regression estimators for use with SRSs. 
The concept of ratio estimation, however, is completely general and is easily extended 
to other sampling designs. In stratified sampling, for example, we can use estimators 
of the population totals for x and y from a stratified sample to give the combined 
ratio estimator 


t yro = Bt, 
where 
- tse 
B=- 
te str 


As in Sections 3.2 and 3.3, 


H H 
ty str = ay = pm > WhjYhj> 


h=1 h=1 jeSp 


where the sampling weight is wp; = (N;,/mn), and 


H H 
teste = ) NpXh = ) ) WhjXhj- 


h=1 h=1 jeSp, 


Then, using the arguments in Section 4.1.2, 


H 
MSE (Gir) x Vibe = Bi, str) =V ~ > WrjVnj _ BXxnj) 3 
h=1 jES;, 
we estimate the MSE by 
t ‘ = 
Bow Le st A 
One) = (; 2) VID wien 
eS h=1 jeSp 
2 
t str ALA 
= (= ~ ) V(te str) 
ty str 
2 
te str At AO A A Za a 
S (; ) [Msn +B V(ty str) — 2BCov (ty strs ics) | > 
Xx,str 


where en; = Ynj — By. In the combined ratio estimator, first the strata are combined 
to estimate 7, and ¢,, then ratio estimation is applied. 

For the separate ratio estimator, ratio estimation is applied first, then the strata 
are combined. The estimator 


A yes a 
A in yn 
byrs = ) tyhr = ) tihz—> 
h=1 


fA th 


EXAMPLE 4.10 
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uses ratio estimation separately in each stratum, with 


H 
Viyrs) = > VGvnr) : 


h=1 


It can improve efficiency if the tyn / teh vary from stratum to stratum, but should not 
be used when strata sample sizes are small because each ratio is biased, and the bias 
can propagate through the strata. Note that poststratification (Section 4.4) is a special 
case of the separate ratio estimator. 

The combined estimator has less bias when the sample sizes in some of the strata 
are small. When the ratios vary greatly from stratum to stratum, however, the combined 
estimator does not take advantage of the extra efficiency afforded by stratification as 
does the separate ratio estimator. Many survey software packages, including SAS, 
calculate the combined ratio estimator by default. 


Steffey et al. (2006) describe the use of combined ratio estimation in the legal case 
Labor Ready v. Gates McDonald. The plaintiff alleged that the defendant had not 
thoroughly investigated claims for worker’s compensation, resulting in overpayments 
for these claims by the plaintiff. A total of N = 940 claims were considered in 1997. 
For each of these, the incurred cost of the claim (x;) was known, and consequently the 
total amount of incurred costs was known to be tf, = $9.407 million. But the plaintiff 
contended that the incurred value amounts were unjustified, and that the assessed 
value (y;) of some claims after a thorough review would differ from the incurred 
value. 

A sampling plan was devised for estimating the total assessed value of all 940 
claims. Since it was expected that the assessed value would be highly correlated 
with the incurred costs, ratio estimation is desirable here. Two strata were sampled: 
Stratum 1 consisted of the claims in which the incurred cost exceeded $25,000, and 
stratum 2 consisted of the smaller claims (incurred cost less than $25,000). Summary 
statistics for the strata are given in the following table, with r;, the sample correlation 
in stratum h: 


Stratum N h Np Xn Sxh y h s yh Th 


1 102. 70 =$59,549.55 $64,047.95 $38,247.80 $32,470.78 0.62 
2 838 101 $5,718.84 $5,982.34 $3,833.16 $5,169.72 0.77 


The sampling fraction was set much higher in stratum | than in stratum 2 because 
the variability is much higher in stratum | (the investigators used a modified form 
of the optimal allocation described in Section 3.4.2). We estimate 


2 

teste = > th = (102)(59,549.55) + (838)(5,718.84) = 10,866,442.02 
h=1 
2 

tysir =) fyn = (102)(38,247.80) + (838)(3,833.16) = 7,113,463.68 


h=1 
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and 


a Dit 7,113,463.68 
Ba=7"= = 0.654626755. 
istr  10,866,442.02 


Using formulas for variances of stratified samples, 
Ron 70 (64,047.95)* 101 (5982.34) 
Vi. str) = ( 1 — — ) (102) l 838)° ————— 
ai) ( im) 707 338) 8” 01 
= 410,119,750,555, 


ass 70 (32,470.78) 101 (5169.72) 
V(ty ste) = ( 1 — — ) (102) 1 et aaa a 
(yt) ( a) one) 70 : ( =x) (838) Ti 


= 212,590,045,044, 


and 


ene ee: 70 32,470.78)(64,047.95)(0.62 
Cov (te str, by str) = (1 = a) (1022! M ¢ ) 


102 70 
101 (5169.72)(5982.34)(0.77) 
(=== 2 
+ ( = ee) 101 
= 205,742,464,829. 


Using the combined ratio estimator, the total assessed value of the claims is 
estimated by 


Ls = (9.407 x 10°)(0.654626755) = $6.158 million 


with standard error 


a 10.866 
SE (hye) = 9.407 V [2.126 + (0.6546)2(4.101) — 2(0.6546)(2.057)] x 10!! 


= $0.371 million. 


We use 169 = (number of observations) — (number of strata) degrees of freedom 
for the CI. An approximate 95% CI for the total assessed value of the claims is 
6.158 + 1.97(0.371), or between $5.43 and $6.89 million. Note that the CI for 4, 
does not contain the total incurred value (t.) of $9.407 million. This supported the 
plaintiff’s case that the total incurred value was too high. m= 


1h 


Model-Based Theory for Ratio and Regression 
Estimation* 


In the design-based theory presented in Sections 4.1 and 4.3, the form of the esti- 
mators y, and Y;eg were motivated by regression models. Properties of the estima- 


tors, however, depend only on the sampling design. Thus, we found in (4.8) that 
N 


1 n (y; — Bi)? , eee 
VOy,) & (1 x) * Nol for an SRS. This variance approximation is 
i=l 


derived from the simple random sampling formulas in Chapter 2 and does not rely 
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on any assumptions about the model. If the model does not fit the data well, ratio or 
regression estimation might not increase precision for estimated means and totals, but 
Cls for the means or totals will be correct in the sense that a 95% CI will have coverage 
probability close to 0.95. Inferences about finite population quantities using ratio or 
regression estimation are correct even if the model does not fit the data well. For that 
reason, the ratio and regression estimators presented in Sections 4.1 and 4.3 are exam- 
ples of model-assisted estimators—a model motivates the form of the estimator, but 
inference depends on the sampling design. Sarndal et al. (1992) present the theory of 
model-assisted estimation, in which inference is based on randomization theory. 

If you have studied regression analysis, you learned a different approach to model- 
fitting in which you make assumptions about the regression model, find the least 
squares estimators of the regression parameters under the model, and plot residuals 
and explore regression diagnostics to check how well the model fits the data. Such a 
model-based approach, pioneered by Brewer (1963) and Royall (1970), can also be 
followed with survey data. As in the model-based approach outlined in Section 2.9, 
the model is used to predict population values that are not in the sample. In this 
section we discuss models that give the point estimators in (4.2) and (4.15) for ratio 
and regression estimation. The variances under a model-based approach, however, 
are different, as we will see. 


46.1. A Model for Ratio Estimation 


We stated earlier that ratio estimation works well in an SRS when a straight line 
through the origin fits well and when the variance of the observations about the line is 
proportional to x. We can state these conditions as a linear regression model: Assume 
that x1,x2,...,xXy are known (and all are greater than zero) and that Y\, Y2,..., Yv 
are independent and follow the model 


Y; = Bx; + &, (4.23) 


where Ey(e;) = 0 and Vy(e;) = o2x;. The independence of observations in the 
model is an explicit statement that the sampling design gives no information that can 
be used in estimating quantities of interest; the sampling procedure has no effect on 
the validity of the model. Under the model, T, = y Y; is a random variable and 
the population total of interest, 7,, is one realization of the random variable 7), (this is 
in contrast to the randomization approach, in which f, is considered to be a fixed but 
unknown quantity and the only random variables are the sample indicators Z;). If S 
represents the set of units in our sample, then 


h= Soy +b > yi. 
ieS igS 


We observe the values of y; for units in the sample, and predict those for units not in 
the sample as Bx;, where B = y/x is the weighted least squares estimate of 6 under 
the model in (4.23) (see Exercise 32). Then a natural estimate of f, is 
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This model results in the ratio estimator of t, given in Section 4.1. Indeed, the finite 
population ratio B = #,/t, is the weighted least squares estimator of B, applied to the 
entire population. 

In many common sampling schemes, we find that if we adopt a model consistent 
with the reasons we would adopt a certain sampling scheme or method of estimation, 
the point estimators obtained using the model are very close to the design-based 
estimators. The model-based variance, though, usually differs from the variance from 
the randomization theory. In randomization theory, or design-based sampling, the 
sampling design determines how sampling variability is estimated. In model-based 
sampling, the model determines how variability is estimated, and the sampling design 
is irrelevant—as long as the model holds, you could choose any n units you want to 
from the population. 

The model-based estimator 


f=) +B) x 


icS i¢S 


is unbiased under the assumed model in (4.23) since 


Eyl? —T] = Eu Yon- oY, = 0. 


igS i¢S 


The model-based variance is 


VulT, —T)|)=Vu oxi = be? 


igS igS 
= Vu | B oxi + Vu pa Y; 
igS i¢S 


because B and Dig s Yj are independent under the model assumptions. The model in 
(4.23) does not depend on which population units are selected to be the sample S, so 
S can be treated as though it is fixed. Consequently, using (4.23), 


Vu bee; = Vu >> (Bx: + 8) = Vu Yo ei = a ae 


igS igS i¢S i¢S 


and, similarly, 


2 Pee 2 ; 


V; B Xp = xi} V, = = Xj aay 
M 2, i 2 I M ye 2 i ya 


icS ieS 


EXAMPLE 4.11 
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Combining the two terms, 


Vault, — T] = a oxi - So xi 


ieS 
eDx 
i¢S 
_ is, 
i 
ieS 
i 
242 
; Ot 
8 eee ee (4.24) 
ieS 


Note that if the sample size is small relative to the population size, then 
A oe 
Vu [T; —T|* 5 
Xi 
icS 


the quantity (1 — }7,..5 x;/t,) serves as an fpc in the model-based approach to ratio 
estimation. 


Let’s perform a model-based analysis of the data from the Census of Agriculture, used 
in Examples 4.2 and 4.3. We already plotted the data in Figure 4.1, and it looked as 
though a straight line through the origin would fit well, and that the variability about 
the line was greater for observations with larger values of x. For the data points with x 
positive, we can run a regression analysis with no intercept and with weight variable 
1/x. SAS PROC REG code used for this analysis is provided on the website. Only 
299 observations are used in this analysis since observation 179, Hudson County, 
New Jersey, has x,79 = 0. 


Dependent Variable: acres92 
NOTE: No intercept in model. R-Square is redefined. 
Weight: recacr87 


Analysis of Variance 


Sum of Mean 
Source DF Squares Square F Value Pr > F 
Model 1 88168461 88168461 41487.3 <.0001 
Error 298 633307 2125 ..19126 
Uncorrected Total 299 88801768 
Root MSE 46.09980 R-Square 0.9929 
Dependent Mean 38097 Adj R-Sq 0.9928 


Coeff Var 


0.12101 


Variable 


acres87 
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Parameter Estimates 


Parameter Standard 
DF Estimate Error t Value Pr > |t| 
i 0.98657 0.00484 203.68 <.0001 


The slope, 0.986565, and the model-based estimate of the total, 9.5151 x 108, are 
the same as the design-based estimates obtained in Example 4.2. The model-based 
standard error of the estimated total, using (4.24), is 


y= oxi 


rs iceS 
o2 : 


icS 
We can use the weighted residuals (for nonzero x;) 
yi — BX: 
i= 
VJ Xi 


to estimate o7: If the model assumptions hold, 67 = > A /(n— 1) (given as the MSE 
in the ANOVA table) estimates o?. Thus 


964,470,625 — 90,586,117 
90,586,117 


SEvITy] = j[exsi926 ( ) (964,470,625) 


= 4,446,719. 


Note that for this example, the model-based standard error is smaller than the 
standard error we calculated using randomization inference, which was 5,344,568. 
The model-based analysis assumes that Vy(e;) = o’x;; the design-based analysis 
does not require such an assumption. 1s 


When adopting a model for a set of data, we need to check the assumptions of the 
model. The assumptions for the model used in this section are: 


1 The model is correct, that is, Ey(Y;) = x;f. 
2 The variance structure is correct, that is, Vy(Y;) = OX). 


3 The observations are independent. 


Typically, assumptions | and 2 are checked by plotting the data and examining 
residuals from the model. Assumption 3, however, is difficult to check in practice, and 
requires knowledge of how the data were collected. Generally, if you take a random 
sample, then you may assume the observations are independent. 

We can perform some checks on the appropriateness of a model with straight line 
through the origin for these data: If the variance of y; about the line is proportional to 
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FIGURE 4.6 
The plot of weighted residuals vs. x, for the random sample from the agricultural census. 
A few counties may be outliers; overall, though, scatter appears to be fairly random. 
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Millions of Acres Devoted to Farms (1987) 


x;, then a plot of the weighted residuals 
yee Bx; 
Vii 
against x; or log x; should not exhibit any patterns. This plot is given for the agriculture 


census data in Figure 4.6; nothing appears in the plot to make us doubt the adequacy 
of this model for the observations in our sample. 


462 A Model for Regression Estimation 
A similar result occurs for regression estimation; for that, the model is 
Y; = Bo + Bixi + €:, 


where the e; are independent and identically distributed with mean 0 and constant 
variance o7. The least squares estimators of By and f in this model are 


>, Gi — ¥s)(¥%i — Ys) 


icS 


> 5 i - 8! 


icS 


Bi = 


and 


Bo = Ys — Bixs. 


EXAMPLE 4.12 
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Then, using the predicted values in place of the units not sampled, 


T=) V+ >) Bot Bix) 
ieS i¢S 
=nYs +) > (Bo + Bix) 
i¢S 


n(Bo + Bixs) + > (Bo + Bixi) 


igS 


N 
Y= (Bo + Bix) 


i=] 


= N(Bo + Bixv). 


The regression estimator of 7, is thus N x (predicted value under the model at xy). 

In practice, if the sample size is small relative to the population size and we have 
an SRS, we can simply ignore the fpc and use the standard error for estimating the 
mean value of a response. From regression theory (see one of the regression books 
listed in the references for Chapter 11), the variance of (Bo + B 1Xy) is 


1 S.A 2 
ea + (xy — x) 
moi? 
ieS 
Thus if n/N is small, 
ca 7) 1 (xu a Xs) 
VulT, — T] + Neo + (4.25) 
nm Gi Fs? 
ieS 


In Example 4.9, the predicted value from the model when x = 11.3 is the regression 
estimator for yy. The predicted value is (5.05929 + 0.61327 x 11.3) = 11.9893. The 
model-based standard error is obtained from (4.25): 


: t, tease i. @13= 106? 
SEy(Yoog) = |62] — + Gu = : - [529] es ) | =0454 
n” (5s) 25 226.006 


icS 


These values can be calculated directly using the SAS PROC REG code on the website. 
The standard error from (4.25) does not incorporate the fpc: Exercise 34 examines 
the fpc in model-based regression. 1m 
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4.3 Differences Between Model-Based and Design-Based 
Estimators 


Under the ratio model, the point estimator for the population total is the same as in 
the design-based approach, but the variance differs from that for the design-based 
estimator. Why aren’t the standard errors the same as in randomization theory? That 
is, how can we have two different variances for the same estimator? The discrepancy is 
due to the different definitions of variance: In design-based sampling, the variance 
is the average squared deviation of the estimate from its expected value, averaged 
over all samples that could be obtained using a given design. If we are using a model, 
the variance is again the average squared deviation of the estimate from its expected 
value, but here the average is over all possible samples that could be generated from 
the population model. 

The model-based estimator uses a prediction approach, in which the values of y; 
not in the sample are predicted using the model. We have 


N 
=) >) La) ne a. 
i=1 


ieS i¢S ieS 


If you were absolutely certain that your model was correct, you could minimize the 
model-based variance of the regression estimator by including only the members of 
the population with the largest and smallest values of x to be in the sample, and 
excluding units with values of x between those extremes. No one would recommend 
such a design in practice, of course, because one never has that much assurance in 
a model. However, nothing in the model says that you should take an SRS or any 
other type of probability sample, or that the sample needs to be representative of the 
population—as long as the model is correct. 

What if the model is wrong? The model-based estimates are only model-unbiased— 
that is, they are unbiased only within the structure of that particular model. If the model 
is wrong, the model-based estimators will be biased, but, from within the model, we 
will not necessarily be able to tell how big the bias is. Thus, if the model is wrong, 
the model-based estimator of the variance generally underestimates the MSE. When 
using model-based inference in sampling, then, you need to be very careful to check 
the assumptions of the model by examining residuals and using other diagnostic tools. 
The assumption of independence is typically the most difficult to check. You can (and 
should!) perform diagnostics to check some of the assumptions of the model for the 
sampled data, but need to realize that you are making a strong, untestable assumption 
that the model applies to population units you did not observe. 

The randomization-based estimator of the MSE may be used whether any given 
model fits the data or not, because randomization inference depends only on how 
the sample was selected. But even the most die-hard randomization theorist relies on 
models for nonresponse, and for designing the survey. Hansen et al. (1983) point out 
that generally randomization theory samplers have a model in mind when designing 
the survey and take that model into account to improve efficiency. 

We will return to this issue in Chapter 11. 


4] 
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Chapter Summary 


Ratio and regression estimation use an auxiliary variable that is highly correlated with 
the variable of interest to reduce the MSE of estimated population means or totals. 
We “know” that y is correlated with x, and we know how far x is from xy, so we use 
this information to adjust y and (we hope) increase the precision of our estimate. The 
estimators in ratio and regression estimation come from models that we hope describe 
the data, but the randomization-theory properties of the estimators do not depend on 
these models. 

As will be seen in Chapter 11, the ratio and regression estimators discussed in this 
chapter are special cases of a generalized regression estimator. All three estimators of 
the population total discussed so far—i,, is and tyreg—can be expressed in terms of 
regression coefficients. For an SRS of size n, the estimators are given in the following 
table. For each, the estimated variance depends on x the sample variance of the e;. 


Estimator ej 


SRS ty yi-y 


: aft % 
Ratio ty (=) y; — Bx; 
NG 


Regression Niy+ By (Gy — x)] yi- Bo _ Bix; 


In an SRS, ratio or regression estimators give greater precision than t when 
Dies @7 for the method is smaller than )~;..¢ (i — J)”. Ratio estimation is especially 
useful in cluster sampling, as we shall see in chapters 5 and 6. 

We often want to find estimates for subpopulations of interest, for example, differ- 
ent age groups. If the sampling frame contains information on the age group for units 
in the population, a stratified sample can be designed as discussed in Chapter 3. If the 
sampling frame does not contain this information but we know the population sizes 
of the subpopulations, then the subpopulations are poststrata and we can estimate the 
population total in group h by Nnyp, where yp, is the mean of the sampled observations 
in the subpopulation. If we do not know the population sizes of the subpopulations, 
then they are domains and we estimate the population total in group h by N(nj;/n)yn. 

In this chapter, we discussed ratio and regression estimation using just one aux- 
iliary variable x. In practice, you may want to use several auxiliary variables. The 
principles for using multiple regression models will be the same; we shall present the 
theory for general surveys in Section 11.7. 


Key Terms 


Calibration: A procedure in which weights are adjusted so that estimated popula- 
tion totals of auxiliary variables coincide with the actual population totals of those 
variables. 


Domain: A subpopulation for which estimates are desired. The domain sample sizes 
are generally random variables. 


Poststratification: A form of ratio estimation in which sampled units are divided 
into subgroups based on characteristics measured in the sample; the population size 
of each subgroup is assumed known. 


4p 


Exercises 


2 


4.8 Exercises 15) 


Ratio estimator: An estimator of the population mean or total based on a ratio with 
an auxiliary quantity for which the population mean or total is known. 


Regression estimator: An estimator of the population mean or total based on a 
regression model using an auxiliary quantity for which the population mean or total 
is known. 


For Further Reading 


Raj (1968) and Cochran (1977) have good treatments of ratio and regression estima- 
tion in SRSs. For regression models in a general framework, discussed in this book 
in Chapter 11, see Sarndal et al. (1992). The overview paper by Sarndal (2007) sum- 
marizes the use of ratio and regression estimation for calibrating survey estimates to 
known population totals. 

The books by Thompson (1997), Brewer (2002) and Valliant et al. (2000) 
describe differences between design-based and model-based approaches to survey 
inference. The articles by Hansen et al. (1983), Rao (1997), Lohr (2001), and Little 
(2004) discuss the relative merits of design- and model-based approaches to 
inference. 


A. Introductory Exercises 
For each of the following situations, indicate how you might use ratio or regression 
estimation. 


a Estimate the proportion of time in television news broadcasts in your city that is 
devoted to sports. 


b Estimate the average number of fish caught per hour for anglers visiting a lake in 
August. 


c Estimate the average amount that undergraduate students at your university spent 
on textbooks in fall semester. 


d_ Estimate the total weight of usable meat (discarding bones, fat, and skin) in a 
shipment of chickens. 


Consider the hypothetical population below, with population values: 


Unit Number x y 


CONIDMNRWNH 
aN 
oo 
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Find the values of the population quantities f,, t,, S,, Sy, R, and B. 


Construct a table like that in Table 4.2, giving the sampling distribution of Vy and 
of bis for a sample of size n = 3. 


ce Draw a histogram of the sampling distribution of byes Compare this histogram to 
a histogram of the sampling distribution of Ny. 


d Find the mean and variance of the sampling distribution of Le How do these 
compare to the mean and variance of Ny? What is the bias of ae 


e Use Equation (4.6), together with the population quantities you calculated in (a) to 
find an approximation to Bias Gi = NBias (y,). How close is the approximation 
to the true bias in (c)? 


Foresters want to estimate the average age of trees in a stand. Determining age is 
cumbersome, because one needs to count the tree rings on a core taken from the 
tree. In general, though, the older the tree, the larger the diameter, and diameter 
is easy to measure. The foresters measure the diameter of all 1132 trees and find 
that the population mean equals 10.3. They then randomly select 20 trees for age 
measurement. 


Tree No. Diameter, x Age, y Tree No. Diameter, x Age, y 
if 12.0 125 11 5.7 61 
2 11.4 119 12 8.0 80 
3 13 83 13 10.3 114 
4 9.0 85 14 12.0 147 
5 10.5 99 15 9.2 122 
6 7.9 117 16 8.5 106 
7 1.3 69 17 7.0 82 
8 10.2 133 18 10.7 88 
9 11.7 154 19 9.3 97 

10 11.3 168 20 8.2 99 


a_ Draw a scatterplot of y vs. x. 


b_ Estimate the population mean age of trees in the stand using ratio estimation and 
give an approximate standard error for your estimate. 


c Repeat (b) using regression estimation. 


d Label your estimates on your graph. How do they compare? 


B. Working with Survey Data 


Use the data in ssc.dat, described in Exercise 14 of Chapter 2, for this problem. 


a Estimate the proportion of female members who are in academia. Note that this 
is a domain mean, with x; = | if person 7 is female and 0 otherwise, and y; = 1 
if person i is female and in academia and 0 otherwise. Give a 95% CI. 


b_ Estimate the total number of female members in academia, along with a 95% CI. 
Use the data in file golfsrs.dat for this problem. Using the 18-hole courses only, 


estimate the average greens fee on a weekend to play 18 holes, along with its standard 
error. 
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For the 18-hole courses in file golfsrs.dat, plot the weekend 18-hole greens fee vs. the 
backtee yardage. Estimate the regression parameters for predicting weekend greens 
fees from backtee yardage. Is there a strong relationship between the two variables? 
Use regression estimation to estimate the weekend 18-hole greens fee with its standard 
error. 


Use the data in file golfsrs.dat for this problem. 

a_ Estimate the mean weekday greens fee to play 9 holes, for courses with a golf 
professional available. 

b Now estimate the mean weekday greens fee to play 9 holes, for courses without 
a golf professional. 


The data set agsrs.dat also contains information on the number of farms in 1987 for 
the SRS of n = 300 counties from the population of the N = 3078 counties in the 
United States (see Example 2.5). In 1987, the United States had a total of 2,087,759 
farms. 


Plot the data. 


b Use ratio estimation to estimate the total number of acres devoted to farming in 


1992, using the number of farms in 1987 as the auxiliary variable. 


Repeat (b), using regression estimation. 


d Which method gives the most precision: ratio estimation with auxiliary variable 


acres87, ratio estimation with auxiliary variable farms87, or regression estimation 
with auxiliary variable farms&7? Why? 


Using the data set agsrs.dat, estimate the total number of acres devoted to farming for 
each of two domains: (a) counties with fewer than 600 farms, and (b) counties with 
600 or more farms. Give standard errors for your estimates. 


The data set cherry.dat, from Hand et al. (1994), contains measurements of diameter 
(inches), height (feet), and timber volume (cubic feet) for a sample of 31 black cherry 
trees. Diameter and height of trees are easily measured, but volume is more difficult 
to measure. 

a Plot volume vs. diameter for the 31 trees. 


b Suppose that these trees are an SRS from a forest of N = 2967 trees and that the 
sum of the diameters for all trees in the forest is t, = 41,835 inches. Use ratio 
estimation to estimate the total volume for all trees in the forest. Give a 95% CI. 


c Use regression estimation to estimate the total volume for all trees in the forest. 
Give a 95% CI. 


The data file counties.dat contains information on land area, population, number of 
physicians, unemployment, and a number of other quantities for an SRS of 100 of the 
3141 counties in the United States (U.S. Census Bureau, 1994). The total land area 
for the United States is 3,536,278 square miles; 1993 population was estimated to be 
255,077,536. 


a_ Draw a histogram of the number of physicians for the 100 counties. 


b Estimate the total number of physicians in the United States, along with its stan- 
dard error, using Ny. 


12 
13 
14 


15 
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ec Plot the number of physicians vs. population for each county. Which method 
do you think is more appropriate for these data: ratio estimation or regression 
estimation? 


d Using the method you chose in (c), use the auxiliary variable population to estimate 
the total number of physicians in the United States, along with the standard error. 


e The “true” value for total number of physicians in the population is 532,638. 
Which method of estimation came closer? 


Repeat parts (a)—(d) of Exercise 11 with y = farm population and x = land area. 
Repeat parts (a)—(d) of Exercise 11 with y = number of veterans and x = population. 


(Model-based analysis; requires material in Section 4.6.) Refer to the situation in 
Exercise 11. Use a model-based analysis to estimate the total number of physicians in 
the United States. Which model did you choose, and why? What are the assumptions 
for the model? Do you think they are met? Be sure to examine the residual plots for 
evidence of the inadequacy of the model. How do your results differ from those you 
obtained in Exercise 11? 


Jackson et al. (1987) compared the precision of systematic and stratified sampling 
for estimating the average concentration of lead and copper in the soil. The 1-km? 
area was divided into 100-m squares, and a soil sample was collected at each of the 
resulting 121 grid intersections. Summary statistics from this systematic sample are 
given below. 


Average Range Standard Deviation 
Element n (mg kg~!) (mg kg~!) (mg kg!) 
Lead 121 127 22-942 146 
Copper 121 35 15-90 16 


The investigators also poststratified the same region. Stratum A consisted of farm- 
land away from roads, villages, and woodlands. Stratum B contained areas within 50m 
of roads, and was expected to have larger concentrations of lead. Stratum C contained 
the woodlands, which were also expected to have larger concentrations of lead because 
the foliage would capture airborne particles. The data on concentration of lead and 
copper were not used in determining the strata. The data from the grid points falling 
in each stratum are in the following table: 


Average Range Standard Deviation 
Element Stratum Ny (mg kg~!) (mg kg~!) (mg kg~!) 
Lead A 82 71 22-201 28 
Lead B 31 259 36-942 232 
Lead Cc 8 189 88-308 79 
Copper A 82 28 15-68 9 
Copper B 31 50 22-90 18 
Copper C 8 45 31-69 ils} 
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a Calculate a 95% CI for the average concentration of lead in the area, using the 
systematic sample. (You may assume that this sample behaves like an SRS.) 
Repeat for the average concentration of copper. 


b Nowuse the poststratified sample, and find 95% Cls for the average concentration 
of lead and copper. How do these compare with the CIs in (a)? Do you think that 
using stratification in future surveys would increase precision? 


Poststratify the sample in data file agsrs.dat into the four census regions given in 
Example 3.2. Estimate the population mean yy using (4.21) and approximate the 
variance using (4.22). How does the approximate 95% CI using poststratification 
compare with that from Example 2.10? 


Using the data in Example 4.10, calculate the separate ratio estimate for the population 
total 7,, along with a 95% CI. 


C. Working with Theory 


(Requires probability.) Use covariances derived in Appendix A to show the result in 
(4.8). 


(Requires computing.) In Equation (4.9), we used 
= 2.9 
AA n XU So 
Vil] = (1 a =) =) = 
iLyr N ( = ) - 
2 


n\ S 
to estimate (1 - o ~4 An alternative estimator that has been proposed is 


~ 3 i= (1 “) se 
2Wrl] = NK 
Generate a population of size 1000 from the model y; = 6x; + &i, where ep 


N(0, 07x;). Now take 100 different samples, each with n = 50. Compare V, and V3. 
If the variability about the line y = 6x increases as x increases, as is the case 


1 n 
for the data generated above, then, if x < xy we would expect i » (y; — Bxiy’ 
n— 
ieS 


to be smaller than 


N 

1 7 A 

Nol y (y; — Bx; — S. Using V; instead of V> partially 
i=l 


compensates for this. See Valliant (2002) for a discussion of why V; is preferred to 
V, from a conditional inference perspective. 


Some books use the formula 
he n 1 9 x n> 9 
V[B] = (1 — —) —(s; — 2Brs,sy + Bes?), 
N/ nx,” : * 
U 


where r is the sample correlation coefficient of x and y for the values in the sample, 
to estimate the variance of a ratio. 


a Show that this formula is algebraically equivalent to (4.10). 
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b = It often does not work as well as (4.10) in practice, however: If s, and s, are large, 
many computer packages will truncate some of the significant digits so that the 
subtraction will be inaccurate. For the data in Example 4.2, calculate the values 
of ee cen r, and B. Use the formula above to calculate the estimated variance of 


B-Is it exactly the same as the value from (4.10)? 


(Requires probability.) Recall from Section 2.2 that MSE = variance + (Bias). Using 
(4.6) and other approximations in that section, show that [E(y, — yu)P is small 
compared to V(y,), when n is large. 


(Requires probability.) Prove (4.6). HINT: Use (4.5) and the derivation of the covari- 
ance of x and y in (A.10) of Appendix A. 


Use Equation (4.6) to find the approximate bias of be and of B. 
Comparing two domain means in an SRS. Suppose there are two domains, defined by 


indicator variable 


1 if unit 7 is in domain 1 
gS ; pete a ; . 
: QO if unit 7 is in domain 2 


Then, letting u; = x;y;, the population values of the two domain means are 
N 
oti ; 
_ tl 


y= yy == 


dx 
i=l 


and 


N 
> = xy 


; =I ty—ty — Yu — uu 
U2 — — — = ™ 
N N — ty 1—xy 


yd -x) 


i=1 


If an SRS of size n is taken from a population of size N, the population domain means 
may be estimated by 


i “Neat. Laz 


a Use an argument similar to that in the discussion following (4.5) to show that 


= 1 = ty _ - - ty — ty & 
C > x ———_C SS > - 1 7 
NON Te) (# 23) [ "Non | 


b_ Foran SRS, show using (A.10) that 
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[Consequently, since Property 7 of Expected Value in Section A.2 implies that V(y; — 
y2) = VO) + V2) — 2 Cov (¥1, ¥2), in an SRS Vy — ¥2) © VO) + V(y2) and an 
approximate 95% CI for yy; — yu2 is given by 


1 — 2 + 1.96) V(51) + VG). 


Thus, for an SRS, the large-sample CI for the difference of two domain means is the 
same (if we ignore the fpc) as you learned in your introductory statistics class. Note, 
though, that this result holds only for an SRS. For more complex sampling designs 
the covariance of the estimated domain means may be nonzero, (see Exercise 21 of 
Chapter 6) so more general methods discussed in Section 11.3 must be used.] 


(Requires mathematical statistics.) Showing (4.8). Suppose thatn/N — Oasn — oo, 
so that the fpc can be ignored. The central limit theorem tells us that under regularity 


conditions, 
X—iy] c S° RSS, 
“L550 | ¥(L a8, “S|: 
where £ denotes convergence in distribution. Show that the limiting distribution of 
./n(y; — Yy) has mean 0 and variance 8 — 2BRS,Sy + as 


Show that if we consider approximations to the MSE in (4.8) and (4.17) to be accurate, 
then the variance of y, from ratio estimation is at least as large as the variance of Yyeg 
from regression estimation. HINT: Look at VOy) - V(vreg) using the formulas in (4.8) 
and (4.17), and show that the difference is non-negative. 

Prove (4.16). 

(Requires probability.) Prove (4.18). 


(Requires probability.) Let d; = y; — [yu + Bi(x; — xy)]. Show that for regression 
estimation, 


N > )\2 
a = 1 n/N 1 d(x; Xx ) 
E[Yreg yu] ons / > 2 : 


nS, @ N=1 
As in Exercise 21, show that (E [reg _ yuly is small compared to MSELvreg], when n 
is large. 

(Requires probability.) Consider the combined ratio estimator of the population total, 
ie from Section 4.5. 

a Show that 


Bias|?yrc| 


Vbsre) 


2 CY (@,); 


Hint: See (4.4). 


b_ Inastratified random sample, find the approximate bias and MSE of tie 


(Requires probability.) Consider the separate ratio estimator of the population total, 
-_ from Section 4.5. Find the bias and an approximation to the MSE of bs ina 
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stratified random sample. Allow different ratios, B,, in each stratum. When will the 
bias be small? 


(Requires linear model theory.) Suppose we have a stochastic model 
Yj = Bxj + €; 


where the ¢;’s are independent with mean 0 and variance o”x;. Show that the weighted 
least squares estimator of 6 is 6 = Y/x. Is the standard error for 6 that comes from 
weighted least squares the same as that for B in (4.10)? 


(Requires linear model theory.) Suppose that the model in (4.23) misspecifies the 
2 


variance structure and that a better model has Vy[é;] = o-. 

a What is the weighted least squares estimator of 6 if Viy[e;] = o*? What is the 
corresponding estimator of the population total for y? 

b_ Derive Vault, = 15 

ce Apply your estimators to the data in agsrs.dat. How do these estimates compare 
with those in Examples 4.2 and 4.11? 


Equation (4.25) gave the model-based variance for a population total when it is 
assumed that the sample size is small relative to the population size. Derive the 
variance incorporating the finite population correction. 


The quantity B used in ratio estimation is sometimes called the ratio-of-means esti- 
mator. An alternative that has been proposed is the mean-of-ratios estimator: Let 
b; = y;/x; for unit 7; then the mean-of-ratios estimator is 


- 1 
a 23 bi. 
ieS 
a Doyouthink the mean-of-ratios estimator is appropriate for the datain Example 4.5? 
Why, or why not? 
b Show that, for the ratio-of-means estimator B, +.B = t, when the entire population 
is sampled (i.e., S = U/). 
c Give an example to show that it is possible to have t,b 4 t, when the entire 
population is sampled. 


d Define 


N 
1 7 " 
Shy = Vo d (bj — by (x — Xu). 


Show that for an SRS of size n, the bias of b as an estimator of B is 
(N — 1)Sx 


ty 


E[b — B] = 


As a consequence, if S,, 4 0 the bias does not decrease as n increases. 
e (Requires linear model theory.) Show that D is the weighted least squares estimator 
of 6 under the model 
Yj = Bxj + &; 


when e;’s are independent with mean 0 and variance Ore 
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(Requires computing.) 

a Generate 500 data sets, each with 30 pairs of observations (x;, y;). Use a bivariate 
normal distribution with means 0, standard deviations 1, and correlation 0.5 to 
generate each pair (x;, y;). For each data set, calculate y and Veecs using xy = 0. 
Graph a histogram of the 500 values of y and another histogram of the 500 values 
of Des What do you see? 


b Repeat part (a) for 500 data sets, each with 60 pairs of observations. 


(Requires computing.) Use the population in agpop.dat for this exercise. 

a Take 500SRSs, each of size n = 40, from the data set. For each SRS, calculate the 
sample mean y, the poststratified mean Ypox from (4.21), the estimated variance 
of y, and the estimated variance of Ypost using (4.22). 

b Calculate the sample variance of the 500 values of y. This gives an estimate of 
the true value of V(y). Compare your value with the average of the 500 values 
of V0). Since VO) i is an unbiased estimator of V(y) for any sample size, these 
values should be close. 


c Calculate the sample variance of the 500 values of Ypost. This gives an estimate of 
the true value of V(Ypost). Compare your value with the average of the 500 values 


of VSpost): 
D. Projects and Activities 


Find a dictionary of a language you have studied. Choose 30 pages at random from 
the dictionary. For each, record 


x = number of words on the page 
y = number of words that you know on the page (be honest!) 
How many words do you estimate are in the dictionary? How many do you estimate 


that you know? What percentage of the words do you know? Give standard errors for 
all your estimates. 


The 2000 U.S. Presidential election generated controversy for many reasons; one part 
of the controversy was that television networks declared that candidate Gore would 
be the winner based on exit polls, which are surveys of voters as they leave polling 
places. Read the article by Mitofsky and Edelman (2002) on estimation in the 2000 
election (the article is available at www.jos.nu). Describe how ratio estimation was 
used in the polls. What were the likely sources of bias in the 2000 exit polls? 


Online bookstore. Use your sample from Exercise 33 in Chapter 2 for this exercise. 


a__ Estimate the ratio (average price/average number of pages) and give the standard 
error. 


b Consider two domains: hardcover books and paperback books. Estimate the mean 
price (with standard error) for books in each domain. 
Forest data. Use your sample from Exercise 36 of Chapter 2 for this exercise. 


a Estimate the ratio of hillshade index at 9 am to hillshade index at noon. Include 
a 95% CI. 


b Estimate the average elevation for each of the 7 forest cover types, along with a 
95% CI. 
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Trucks. Use the data described in Exercise 34 of Chapter 3 for this exercise. 


a_ The variable business describes the primary business in which the vehicle was 
used in 2002. Estimate the total miles driven for each type of business in 2002, 
along with a 95% CI. How is this a special case of estimating a domain total? 

b Estimate the average miles per gallon (MPG) for each of the transmission types 
(transmssn), along with a 95% CI. 

c Estimate the ratio of miles driven in 2002 (miles_annl) to lifetime miles driven 
(miles_life), along with a 95% CI. 

Baseball data. 

a_ Using your SRS from Exercise 32 of Chapter 2, estimate the mean log salary for 
players in each position along with the standard errors. 

b Estimate the ratio (total number of home runs)/(number of runs scored) for the 
population and give a 95% CI. 

IPUMS exercises. 

a_ Using your SRS from Exercise 37 of Chapter 2, try estimating total income (inctot) 
using ratio estimation with age as an auxiliary variable. Does it decrease the 
standard error? Why, or why not? (Include a plot as part of your answer.) 

b Using one of the following variables: age, sex, race, or marstat, use regression 


estimation to calibrate your estimate of total income to the category totals for the 
variable you chose. 


Cluster Sampling with Equal 
Probabilities 


“But averages aren't real,” objected Milo; “they're just imaginary.” 

“That may be so,” he agreed, “but they're also very useful at times. For instance, if you didn't have 
any money at all, but you happened to be with four other people who had ten dollars apiece, then you'd 
each have an average of eight dollars. Isn't that right?” 

“| guess so,” said Milo weakly. 

“Well, think how much better off you'd be, just because of averages,” he explained convincingly. 
“And think of the poor farmer when it doesn’t rain all year: if there wasn't an average yearly rainfall of 
37 inches in this part of the country, all his crops would wither and die.” 

It all sounded terribly confusing to Milo, for he had always had trouble in school with just this 
subject. 

“There are still other advantages,” continued the child. “For instance, if one rat were cornered by 
nine cats, then, on the average, each cat would be 10 per cent rat and the rat would be 90 per cent cat. 
If you happened to be a rat, you can see how much nicer it would make things.” 


—Norton Juster, 7he Phantom Tollbooth 


In all the sampling procedures discussed so far, we have assumed that the population 
is given and all we must do is reach in and take a suitable sample of units. But units 
are not necessarily nicely defined, even when the population is. There may be several 
ways of listing the units, and the unit size we choose may very well contain smaller 
subunits. 

Suppose we want to find out how many bicycles are owned by residents in a 
community of 10,000 households. We could take a simple random sample (SRS) of 
400 households, or we could divide the community into blocks of about 20 households 
each and sample every household (or subsample some of the households) in each of 
20 blocks selected at random from the 500 blocks in the community. The latter plan is 
an example of cluster sampling. The blocks are the primary sampling units (psus), 
or clusters. (In this chapter, we use the terms cluster and psu interchangeably.) The 
households are the secondary sampling units (ssus); often the ssus are the elements 
in the population. 

The cluster sample of 400 households is likely to give less precision than an SRS 
of 400 households; some blocks of the community are composed mainly of families 
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(with more bicycles), while the residents of other blocks are mainly retirees (with 
fewer bicycles). Twenty households in the same block are not as likely to mirror the 
diversity of the community as well as 20 households chosen at random. Thus, cluster 
sampling in this situation will probably result in less information per observation 
than an SRS of the same size. However, if you conduct the survey in person, it is 
much cheaper and easier to interview all 20 households in a block than 20 households 
selected at random from the community, so cluster sampling may well result in more 
information per dollar spent. 

In cluster sampling, individual elements of the population are allowed in the 
sample only if they belong to a cluster (psu) that is included in the sample. The 
sampling unit (psu) is not the same as the observation unit (ssu), and the two sizes of 
experimental units must be considered when calculating standard errors from cluster 
samples. 

Why use cluster samples? 


1 Constructing a sampling frame list of observation units may be difficult, expensive, 
or impossible. We cannot list all honeybees in a region or all customers of a store; 
we may be able to construct a list of all trees in a stand of northern hardwood 
forest or a list of individuals in a city for which we only have a list of housing 
units, but constructing the list will be time consuming and expensive. 


2 The population may be widely distributed geographically or may occur in natural 
clusters such as households or schools, and it is less expensive to take a sample of 
clusters rather than an SRS of individuals. If the target population is residents of 
nursing homes in the United States, it is much cheaper to sample nursing homes 
and interview every resident in the selected homes than to interview an SRS of 
nursing home residents: With an SRS of residents, you might have to travel to a 
nursing home just to interview one resident. If taking an archaeological survey, you 
would examine all of the artifacts found in a region—you would not just choose 
points at random and examine only artifacts occurring at those isolated points. 


Clusters bear a superficial resemblance to strata: A cluster, like a stratum, is a 
grouping of the members of the population. The selection process, though, is quite 
different in the two methods. Similarities and differences between cluster samples 
and stratified samples are illustrated in Figure 5.1. 

Whereas stratification generally increases precision when compared with simple 
random sampling, cluster sampling generally decreases it. Members of the same 
cluster tend to be more similar than elements selected at random from the whole 
population—members of the same household tend to have similar political views; fish 
in the same lake tend to have similar concentrations of mercury; residents of the same 
nursing home tend to have similar opinions of the quality of care. These similarities 
usually arise because of some underlying factors that may or may not be measurable— 
residents of the same nursing home may have similar opinions because the care is 
poor, or the concentration of mercury in the fish may reflect the concentration of 
mercury in the lake. Thus, we do not obtain as much information about all nursing 
home residents in the United States by sampling two residents in the same home as 
by sampling two residents in different homes, because the two residents in the same 
home are likely to have more similar opinions. By sampling everyone in the cluster, 


FIGURE 5.1 
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Similarities and differences between stratified sampling and one-stage cluster sampling 


Stratified Sampling 


Cluster Sampling 


Each element of the population is in exactly one stratum. 


Each element of the population is in exactly one cluster. 


Population of H strata; stratum h has n,, elements: 


H 
se 

- 
“ties 


One-stage cluster sampling; population of N clusters: 


Take an SRS of clusters; observe all elements within 
the clusters in the sample: 


a ii 


Variance of the estimate of y,, depends on the 
variability of values within strata. 


The cluster is the sampling unit; the more clusters 
we sample, the smaller the variance. The variance 
of the estimate of y,, depends primarily on the 
variability between cluster means. 


For greatest precision, individual elements within each 
stratum should have similar values, but stratum means 
should differ from each other as much as possible. 


For greatest precision, individual elements within 
each cluster should be heterogeneous, and cluster 
means should be similar to one another. 


we partially repeat the same information instead of obtaining new information, and 
that gives us less precision for estimates of population quantities. Cluster sampling is 
used in practice because it is usually much cheaper and more convenient to sample 
in clusters than randomly in the population. Most large household surveys carried 
out by the U.S. government, or by commercial or academic institutions, use cluster 
sampling because of the cost savings. 


EXAMPLE 5.1 


0.1 
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One of the biggest mistakes made by researchers using survey data is to ana- 
lyze a cluster sample as if it were an SRS. Such confusion usually results in the 
researchers reporting standard errors that are much smaller than they should be; this 
gives the impression that the survey results are much more precise than they really are. 
Exercise 33 presents an activity for exploring what happens to properties of confidence 
intervals (CIs) when clustered data are analyzed incorrectly. 


Basow and Silberg (1987) report results of their research on whether students evaluate 
female college professors differently than they evaluate male college professors. The 
authors matched 16 female professors with 16 male professors by subject taught, 
years of teaching experience, and tenure status, and gave evaluation questionnaires 
to students in those professors’ classes. The sample size for analyzing this study is 
n = 32, the number of faculty studied; it is not 1029, the number of students who 
returned questionnaires. Students’ evaluations of faculty reflect the different styles of 
faculty teaching; students within the same class are likely to have some agreement 
in their rating of the professor and should not be treated as independent observations 
because their ratings will probably be positively correlated. If this positive correlation 
is ignored and the student ratings treated as independent observations, differences will 
be declared statistically significant far more often than they should be. = 


After a brief journey into “notation land” in Section 5.1, we begin by discussing 
one-stage cluster sampling, in which every element within a sampled cluster is 
included in the sample. We then generalize the results to two-stage cluster sampling, 
in which we subsample only some of the elements of selected clusters, in Section 5.3. 
In Section 5.4, we discuss design issues for cluster sampling, including selection of 
subsample and sample sizes. In Section 5.5, we return to systematic sampling, which 
we previously discussed in Section 2.7, and show that it is a special case of cluster 
sampling. The chapter concludes with theory of cluster sampling from the model- 
based perspective; we shall derive the design-based theory in the more general setting 
of Section 6.6. 


Notation for Cluster Sampling 


In simple random sampling, the units sampled are also the elements observed. In clus- 
ter sampling, the sampling units are the clusters (psus) and the elements observed are 
the ssus within the clusters. The universe U/ is the population of N psus; S designates 
the sample of psus chosen from the population of psus, and S; is the sample of ssus 
chosen from the ith psu. The measured quantities are 


yi = measurement for jth element in ith psu, 


but in cluster sampling, it is easiest to think at the psu level in terms of cluster totals. 
No matter how you define it, the notation for cluster sampling is messy because you 
need notation for both the psu and the ssu levels. The notation used in this chapter 
and Chapter 6 is presented in this section for easy reference. Note that in Chapters 5 
and 6, N is the number of psus, not the number of observation units. 


5.1 Notation for Cluster Sampling 


psu Level—Population Quantities 


N = number of psus in the population 


M; = number of ssus in psu i 
N 
Mo = M; = total number of ssus in the population 
i= 
Mi 
t; = yyy = total in psui 
j= 
N NM; 
r= a t; = a :e Yi = population total 
i i=1 j=l 


N 2 
1 t 
y (: ) = population variance of the psu totals 


NM; 
y= > A = population mean 
i=1 j=l 
Mi 
- Yij tj . : : 
Ys = 
Vu = van population mean in psu i 
j=l L U 
N Mi - 2 
(vii — yu) : : 
S* = = population variance (per ssu 
Xd Mean ee (per ssu) 
4 Ow — Sw) 
S? = 7 — = population variance within psu i 
jer 


Sample Quantities 


n = number of psus in the sample 


m; = number of ssus in the sample from psu i 


= = 


jeS; | 


sample mean (per ssu) for psu i 


M; 
t; = = Yi = estimated total for psu i 


jeS; | 


N, ; : : 
ting = Y ae = unbiased estimator of population total 
ieS 
4 1 


a 2 
A tunb 
ert =) 


1€ 
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5.)2 

5 Oi — yi) ‘ es : 

sS= ) Saaeepy = sample variance within psu i 

m: — 
jeSi 


w; = sampling weight for ssu j in psu i 


One-Stage Cluster Sampling 


aL 


In one-stage cluster sampling, either all or none of the elements that compose a 
cluster (= psu) are in the sample. One-stage cluster sampling is used in many surveys 
in which the cost of sampling ssus is negligible compared with the cost of sampling 
psus. For education surveys, a natural psu is the classroom; all students in a selected 
classroom are often included as the ssus since little extra cost is added by handing 
out a questionnaire to all students in the classroom rather than some. 

In the population of N psus, the ith psu contains M; ssus (elements). In the simplest 
design, we take an SRS of n psus from the population and measure our variable of 
interest on every element in the sampled psus. Thus, for one-stage cluster sampling, 
M, i= Mj. 


Clusters of Equal Sizes: Estimation 


Let’s consider the simplest case in which each psu has the same number of elements, 
with M; = m; = M. Most naturally occurring clusters of people do not fit into this 
framework, but it can occur in agricultural and industrial sampling. Estimating pop- 
ulation means or totals is simple: We treat the psu means or totals as the observations 
and simply ignore the individual elements. 

Thus, we have an SRS of n data points {t;,i € Sy}; t; is the total for all the 
elements in psu i. Then fs = }°;-5 t;/n estimates the average of the cluster totals. 
In a household survey to estimate income in two-person households, the individual 
observations yj are the incomes of individual persons within the household, ¢; is the 
total income for household j (¢; is known for sampled households because both persons 
are interviewed), fy is the average income per household, and yy is the average income 
per person. To estimate the total income ¢, we can use the estimator 


i= =, (5.1) 


The results in sections 2.3 and 2.8 apply to 7 because we have an SRS of n units from 
a population of N units. As a result, 7 is an unbiased estimator of t, with variance 
given by 


é ny\ S2 
V@ =n? (1-5) = 5.2 
®=n* (1-5) (5.2) 
and with standard error 
SE() = N (1 = ~) % (5.3) 
~ N/ n’ : 


EXAMPLE 5.2 


5.2 One-Stage Cluster Sampling iv 


where S 2 and Se are the population and sample variance, respectively, of the psu totals: 


and 


tes 2 5.4 
Y= xu (5.4) 
with 
is n S? 
vis=(i- 2) ss 
and 
SEG) = 2 (1 a (5.6) 
= Mi N/a’ 


No new ideas are introduced to carry out one-stage cluster sampling; we simply 
use the results for simple random sampling with the psu totals as the observations. 


A student wants to estimate the average grade point average (GPA) in his dormitory. 
Instead of obtaining a listing of all students in the dorm and conducting an SRS, he 
notices that the dorm consists of 100 suites, each with four students; he chooses 5 of 
those suites at random, and asks every person in the 5 suites what her or his GPA is. 
The results are as follows: 


Person Suite (psu) 
Number 1 2 3 4 5 
1 3.08 2.36 2.00 3.00 2.68 
2 2.60 3.04 2.56 2.88 1.92 
3 3.44 3.28 2.52 3.44 3.28 
4 3.04 2.68 1.88 3.64 3.20 
Total 12.16 11.36 8.96 12.96 11.08 


The psus are the suites, so N = 100,n = 5, and M = 4. The estimate of the population 
total (the estimated sum of all the GPAs for everyone in the dorm—a meaningless 
quantity for this example but useful for demonstrating the procedure) is 


, 100 
t= = (12.16 + 11.36 + 8.96 + 12.96 + 11.08) = 1130.4. 
The average of the suite totals is estimated by tf = 1130.4/100 = 11.304, and 


$= —* [12.16 — 11.304? +--+ (11.08 — 11.304)?] = 2.256 
i = 5 [02 . ; 256. 
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Note that s? is simply the usual sample variance of the 5 suite totals. Thus, using (5.4) 
and (5.6), y = 1130.4/400 = 2.826, and 


a 5 2.256 
SE(y) = |( = a) ae = 0.164. 


Note that in these calculations, only the “total” row of the data table is used—the 
individual GPAs are only used for their contribution to the suite total. = 


One-stage cluster sampling with an SRS of psus produces a self-weighting sample. 
The weight for each observation unit is 


1 
~ P{ssuj of psuiisinsample} n° 


Wij 
For the data in Example 5.2, then, 
P= DD mini 
icS jESj 


N 
= —(3.08 + 2.60 + --- + 3.28 + 3.20) 
n 


100 
(56.52) = 1130.4. 


Thus, as in stratified sampling, we can estimate a population total by summing the 
product of the observed values and the sampling weights. The population mean is 
estimated by 


ed mivi 
pe iEeS jes; 
PR 
iceS jES; 
1130.4 


NM 
= 2.826. 


SAS code for analyzing these data is given on the website. The output follows. 
The only indication from the output that the analysis uses the clustering is in the 
data summary line giving the number of clusters. The sum of the weights can some- 
times be used to diagnose problems in your weight calculations; if the sum of the 
weights is far from the number of observations, you may have calculated the weights 
incorrectly. 
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Data Summary 


Number of Clusters 5 
Number of Observations 20 
Sum of Weights 400 
Statistics 
Std Error 
Variable N Mean of Mean 95% CL for Mean 
gpa 20 2.826000 0.163665 2.37159339 3.28040661 


If we had taken an SRS of nM elements, each element in the sample would have 
been assigned weight (VM)/(nM) = N/n—the same weights we obtain for cluster 
sampling. The precision obtained for the two types of sampling, however, can differ 
greatly; the difference in precision is explored in the next section. 
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In this section we compare cluster sampling with simple random sampling: Cluster 
sampling almost always provides less precision for the estimators than one would 
obtain by taking an SRS with the same number of elements. 

As in stratified sampling, let’s look at the ANOVA table (Table 5.1) for the whole 
population. In stratified sampling, the variance of the estimator of t depended on 
the variability within the strata; Equation (3.3) and Table 3.3 imply that the variance 
in stratified sampling is small if SSW is small relative to SSTO, or equivalently, if 
the within mean square (MSW) is small relative to S. In stratified sampling, you 
have some information about every stratum, so you need not worry about variability 
due to unsampled strata. If MSB/MSW is large—that is, the variability among the 
strata means is large when compared with the variability within strata—then stratified 
sampling increases precision. 

The opposite situation occurs in cluster sampling. In one-stage cluster sampling 
when each psu has M ssus, the variability of the unbiased estimator of t depends 
entirely on the between-psu part of the variability, because 


N - N é a 
(t; — tu) M? (iv — Yu) 
; d N-1 2 N-1 Mer) 
Thus, for cluster sampling, 
‘ M(MSB 
V(custer) = N? (1 = ~) MN (5.7) 
N n 


If MSB/MSW is large in cluster sampling, then cluster sampling decreases preci- 
sion. In that situation, MSB is relatively large because it measures the cluster-to-cluster 
variability: Elements in different clusters often vary more than elements in the same 
cluster because different clusters have different means. If we took a cluster sample 
of classes and sampled all students within the selected classes, we would likely find 


14 Chapter 5: Cluster Sampling with Equal Probabilities 


TABLE 5.1 
Population ANOVA Table—Cluster Sampling 


Source df Sum of Squares Mean Square 

N M 

Between psus N-1 SSB = = ~ Qiu — yu) MSB 
i=l j=l 
vo M 

Within psus NM-1) SSW= (vj — Juv)? MSW 
i=l j=l 
N M 

Total, about yy NM — 1 SSTO = s = (vi — yu? Ss? 
i=1 j=l 


that average reading scores varied from class to class. An excellent reading teacher 
might raise the reading scores for the entire class; a class of students from an area with 
much poverty might tend to be undernourished and not score as highly at reading. 
Unmeasured factors, such as teaching skill or poverty, can affect the overall mean for 
a cluster, and thus cause MSB to be large. 

Within a class, too, students’ reading scores vary. The MSW is the pooled value 
of the within-cluster variances: the variance from element to element, present for 
all elements of the population. If the clusters are relatively homogeneous—if, for 
example, students in the same class have similar scores—the MSW will be small. 

Now let’s compare cluster sampling to simple random sampling. If, instead of 
taking a cluster sample of M elements in each of n clusters, we had taken an SRS 
with nM observations, the variance of the estimated total would have been 


* nM S? n\ MS? 
V(isrs) = (NMY* (1 — — ) — =N7(1-— . 
(ses) = (VM) ( oa) nM ( 7) n 


Comparing this with (5.7), we see that if MSB > S?, then cluster sampling is less 
efficient than simple random sampling. 

The intraclass (sometimes called intracluster) correlation coefficient (ICC) 
tells us how similar elements in the same cluster are. It provides a measure of homo- 
geneity within the clusters. ICC is defined to be the Pearson correlation coefficient 
for the NM(M — 1) pairs (yj, yx) for i between | and N andj # k (see Exercise 22) 
and can be written in terms of the population ANOVA table quantities as 


M_ SSW 


ICC = 1 — ——_ ___. : 
- M — 1 SSTO 


Because 0 < SSW/SSTO < 1, it follows from (5.8) that 


1 
M-1 


<ICC < 1. 
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If the clusters are perfectly homogeneous and hence SSW = 0, then ICC = 1. Equation 
(5.8) also implies that 


mes S?[1 + (M — 1)ICC] (5.9) 
~ M(N —1) : 
How much precision do we lose by taking a cluster sample? From (5.7) and (5.9), 
V (fetuster) = MSB = NM —- 1 
Viisrzs) SS? M(N — 1) 


[1+ (M — 1)ICC]. (5.10) 


If N, the number of psus in the population, is large so that NM — 1 ~ M(N — 1), then 
the ratio of the variances in (5.10) is approximately 1+(M—1)ICC. So 1+(M@— 1)ICC 
ssus, taken in a one-stage cluster sample, give us approximately the same amount of 
information as one ssu from an SRS. If ICC = 1/2 and M=5, then 1 +(M — 1)ICC = 3, 
and we would need to measure 300 elements using a cluster sample to obtain the same 
precision as an SRS of 100 elements. We hope, though, that because it is often much 
cheaper and easier to collect data in a cluster sample, that we will have more precision 
per dollar spent in cluster sampling. 

The ICC provides a measure of homogeneity for the clusters. The ICC is positive 
if elements within a psu tend to be similar; then, SSW will be small relative to SSTO, 
and the ICC relatively large. When the ICC is positive, cluster sampling is less efficient 
than simple random sampling of elements. 

If the clusters occur naturally in the population, the ICC is usually positive. Ele- 
ments within the same cluster tend to be more similar than elements selected at random 
from the population. This may occur because the elements in a cluster share a similar 
environment—we would expect wells in the same geographic cluster to have similar 
levels of pesticides, or we would expect one area of a city to have a different incidence 
of measles than another area of a city. In human populations, personal choice as well 
as interactions among household members or neighbors may cause the ICC to be 
positive—wealthy households tend to live in similar neighborhoods, and persons in 
the same neighborhood may share similar opinions. 

The ICC is negative if elements within a cluster are dispersed more than arandomly 
chosen group would be. This forces the cluster means to be very nearly equal— 
because SSTO = SSW + SSB, if SSTO is held fixed and SSW is large, then SSB must 
be small. If ICC < 0, cluster sampling is more efficient than simple random sampling 
of elements. The ICC is rarely negative in naturally occurring clusters, but negative 
values can occur in some systematic samples or artificial clusters, as discussed in 
Section 5.5. 

The ICC is only defined for clusters of equal sizes. An alternative measure of 
homogeneity in general populations is the adjusted R’, called R> and defined as 


MSW 
iste. 
a S2 


(5.11) 


If all psus are of the same size, then the increase in variance due to cluster sampling is 


V(éetuster) _ MSB = N(M _ 1) 2, 
Viisrs) SS? N-1 ” 
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by comparing with (5.10), you can see that for many populations, R? is close to the ICC. 
The quantity R? is a reasonable measure of homogeneity because of its interpretation 
in linear regression: It is the relative amount of variability in the population explained 
by the psu means, adjusted for the number of degrees of freedom. If the psus are 
homogeneous, then the psu means are highly variable relative to the variation within 
psus, and R will be high. 


EXAMPLE 5.3 _ Consider two artificial populations, each having three psus with three elements per 
psu. 


Population A Population B 


psu 1 10 20 30 9 10 11 
psu 2 11 20 32 17 20 20 
psu 3 9 17 31 31 32 30 


The elements are the same in the two populations, so the populations share the 
values yy = 20 and S* = 84.5. In population A, the psu means are similar and most 
of the variability occurs within psus; in population B, most of the variability occurs 
between psus. 


| Population A | Population B 


Yiu S? | Fw S} 
psu 1 20 100 10 1 
psu 2 21 111 19 3 
psu 3 19 124 31 1 
ANOVA Table for Population A: ANOVA Table for Population B: 

Source df SS MS Source df SS MS 
Between psus 2 6 3 Between psus 2 666 333 
Within psus 6 670 111.67 Within psus 6 10 1.67 
Total, about mean 8 676 84.5 Total, about mean 8 676 84.5 
For population A: For population B: 

111.67 1.67 
Re=1- = —0.3215 R? = 1— —— = 0.9803 
84.5 84.5 
7 1 
ICC =1—- ops = —0.4867 ICC =1- eyo = 0.9778 
2) 676 2) 676 


Population A has much variation among elements within the psus, but little varia- 
tion among the psu means. This is reflected in the large negative values of the ICC and 
R>: Elements in the same cluster are actually less similar than randomly selected ele- 
ments from the whole population. For this situation, cluster sampling is more efficient 
than simple random sampling. 


EXAMPLE 5.4 
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The opposite situation occurs in population B: Most of the variability occurs 
between psus, and the psus themselves are relatively homogeneous. The ICC and 
R> are very close to 1, indicating that little new information would be gleaned by 
sampling more than one element per psu. Here, one-stage cluster sampling is much 
less efficient than simple random sampling. «= 


Most real-life populations fall somewhere between these two extremes. The ICC 
is usually positive, but not overly close to 1. Thus, there is a penalty in efficiency for 
using cluster sampling, and that decreased efficiency should be offset by cost savings. 


When all psus are the same size, we can estimate the variance of 7 as well as the ICC 
from the sample ANOVA table. Here is the sample ANOVA table for the GPA data 
from Example 5.2: 


Source df SS MS 
Between suites 4 2.2557 0.56392 
Within suites 15 2.7756 0.18504 
Total 19 5.0313 0.26480 


In one-stage cluster sampling with equal psu sizes, the mean squares for within 
suites and between suites are unbiased estimators of the corresponding quantities in 
the population ANOVA table (see Exercise 25). Thus, 


ae S2 
E [MsB | = MSB = ~ 
M 


and, using (5.7), 


2 n\ MSB 5 \ 0.56392 
SEG) = (1-5) = {G- ag) an = O.164 


as calculated in Example 5.2. 

The sample mean square total is biased for estimating S”, though (see Exercise 26). 
Note that we can estimate the sums of squares from the population ANOVA table by 
SSB = (N — 1)MSB and SSW = N(M — 1)MSw, so an unbiased estimator of S? is 


2 _ We 1)MSB + N(M — 1)MSW 
NM —1 
For the GPA data, SSB = (99)(0.56392) = 55.828 and SSW = (300)(0.18504) = 


55.512. Consequently, SSTO = 55.828 + 55.512 = 111.340. The estimates of the 
population sums of squares are given in the following table: 


df Ss (estimated) MS 
Between suites 99 55.828 0.56392 
Within suites 300 55.512 0.18504 
Total 399 111.340 0.279 


EXAMPLE 5.5 
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Using these estimates, = 111.340/399 = 0.279 (note the difference between this 
estimate and the one from the sample ANOVA table, 0.265). In addition, 


Se M SSW 4\ 55.512 
ICC =1 = 1 = 0.335 
M —15SSB+SSw 3/ 111.34 
and 
Z MSW 0.18504 
R=1-——=1 = 0.337. 
§2 0.279 


The increase in variance for using cluster sampling is estimated to be 


MSB _ 0.56392 
IS = 2.02. 
S2 0.279 
This says that we need to sample about 2.02 n elements in a cluster sample to get the 


same precision as an SRS of size n. There are 4 persons in each psu, so in terms of 
precision, one psu is worth about 4/2.02 = 1.98 SRS persons. = 


When is a cluster not a cluster? When it’s the whole population. 

Consider the situation of sampling oak trees on Santa Cruz Island, described in 
Example 4.5. There, the sampling unit was one tree, and an observation unit was a 
seedling by the tree. The population of interest was seedlings of oak trees on Santa 
Cruz Island. An SRS of trees was used to estimate quantities of interest about the 
population of oak trees on the island. 

But suppose the investigator had been interested in seedling survival in all of 
California, had divided the regions with oak trees into equal-sized areas, and had 
randomly selected five of those areas to be in the study. Then the primary sampling 
unit is the area, and trees are subsampled in each area. If Santa Cruz Island had been 
selected as one of the five areas, we could no longer treat the ten trees on Santa Cruz 
Island as though they were part of a random sample of trees from the population; 
instead, those trees are part of the Santa Cruz Island cluster. We would expect all ten 
trees on Santa Cruz Island to experience, as a group, different environmental factors 
(such as weather conditions and numbers of seedling eaters) than the ten trees selected 
in the Santa Ynez Valley on the mainland. Thus the ICC within each cluster (area) 
would likely be positive. 

However, suppose we were only interested in the seedlings from Tree #10 on 
Santa Cruz Island. Then the population is all seedlings from Tree #10, and the primary 
sampling unit is the seedling. In this situation, then, the tree is not a cluster but is the 
entire population. = 


b23 Clusters of Unequal Sizes 


Clusters are rarely of equal sizes in social surveys. In one of the early probability 
samples (Converse, 1987), the Enumerative Check Census of 1937, a 2% sample 
of postal routes was chosen, and questionnaires were distributed to all households 
on each chosen postal route with the goal of checking unemployment figures. Since 
postal routes had different numbers of households, the cluster sizes could vary greatly. 
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In a one-stage cluster sample of n of the N psus, we know how to estimate 
population totals and means in two ways: using unbiased estimation and using ratio 
estimation. 


Unbiased Estimation. An unbiased estimator of t is calculated exactly as in (5.1): 


F N 
aS te (5.12) 
af ieS 
and, by (5.3), 
SE(anb) = NV (1 ny 5 5.13 
(tunb) = = <) ~ ( . ) 


The difference between unequal- and equal-sized clusters is that the variation among 
the individual cluster totals 7; is likely to be large when the clusters have different 
sizes. The investigators conducting the Enumerative Check Census of 1937 were 
interested in the total number of unemployed persons, and ¢; would be the number of 
unemployed persons in postal route i. One would expect to find more persons, and 
hence more unemployed persons, on a postal route with a large number of households 
than on a postal route with a small number of households. So we would expect that 
t; would be large when the psu size M; is large, and small when M; is small. Often, 
then, s? is larger in a cluster sample when the psus have unequal sizes than when the 
psus all have the same number of ssus. 

The probability that a psu is in the sample is n/N, as an SRS of n of the N psus 
is taken. Since one-stage cluster sampling is used, an ssu is included in the sample 
whenever its psu is included in the sample. Thus, as in Section 5.2.1, 


_ 1 _N 
~ P{ssuj of psuiisin sample} n’ 


Wij 


One-stage cluster sampling produces a self-weighting sample when the psus are 
selected with equal probabilities. Using the weights, (5.12) may be written as 


fond = > >. way. (5.14) 


ieS jES; 
We can use (5.12) and (5.13) to derive an unbiased estimator for yy and to find 
its standard error. Define 


as the total number of ssus in the population; then iis = funb /Mpo and SE (Yunb) = 
SE (fanb)/Mo.- The unbiased estimator of the mean Vunb can be inefficient when the 
values of M; are unequal since it, like tunb> depends on the variability of the cluster 
totals ¢;. It also requires that Mp be known; however, we often know M; only for 
the sampled clusters. In the Enumerative Check Census, for example, the number of 
households on a postal route would only be ascertained for the postal routes actually 
chosen to be in the sample. We now examine another estimator for yy that is usually 
more efficient when the population psu sizes are unequal. 
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Ratio Estimation. We usually expect f; to be positively correlated with M;. If psus are 
counties, we would expect the total number of households living in poverty in county 
i (t;) to be roughly proportional to the total number of households in county i (Mj). 
The population mean yy is a ratio 


where ft; and M; are usually positively correlated. Thus, yy = B as in Section 4.1 
(substituting ¢; for y; and using M; as the auxiliary variable x;). Define 


by ti So Mii 


a tunb ieS ieS 
Yr =>-, = = a (5.15) 

i mm 

ieS ieS 
Note that >, from (5.15) may also be calculated using the weights w;, as 
~ wy 
a fun icS jESj 
-wi—* (5.16) 


te wy 


icS jES; 


Since an SRS of clusters is selected, all the weights are the same with wj = N/n. 

The estimator >, in (5.15) is the quantity Bin (4.2): the denominator is a random 
quantity that depends on which particular psus are included in the sample. If the M/;’s 
are unequal and a different cluster sample of size n is taken, the denominator will 
likely be different. From (4.10), 


4 n 1 icS 
SEG) = || (1-5) = 
SMG: — Ye)? 
n ieS 
= (c=) are (5.17) 


The variance of the ratio estimator depends on the variability of the means per element 
in the clusters, and can be much smaller than that of the unbiased estimator ae 

If the total number of elements in the population, My = yor , M;, is known, we 
can also use ratio estimation to estimate the population total: the ratio estimator is 
t, = Moy >, with SE (i,.) = Mo SE (,). Note, though, that t, requires that we know 
the total number of elements in the population, Mo; the unbiased estimator in (5.12) 
makes no such requirement. 
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EXAMPLE 5.6 One-stage cluster samples are often used in educational studies, since students are 
naturally clustered into classrooms or schools. Consider a population of 187 high 
school algebra classes in a city. An investigator takes an SRS of 12 of those classes 
and gives each student in the sampled classes a test about function knowledge. The 
(hypothetical) data are given in the file algebra.dat, with the following summary 
statistics. 


Class Number Mi Vi ti M?6; - 3, 
23 20 61.5 1,230 456.7298 
37 26 64.2 1,670 1,867.7428 
38 24 58.4 1,402 9,929.2225 
39 34 58.0 1,972 24,127.7518 
41 26 58.0 1,508 14,109.3082 
44 28 64.9 1,816 4,106.2808 
46 19 55.2 1,048 19,825.3937 
51 32 72.1 2,308 93,517.3218 
58 17 58.2 989 5,574.9446 
62 21 66.6 1,398 7,066.1174 

106 26 62.3 1,621 33.4386 
108 26 67.2 1,746 14212.7867 
Total 299 18,708 194,827.0387 


We can use either (5.15) or (5.16) to estimate the mean score in the population: Using 
(5.15), 


The standard error, from (5.17), is 


a 12 1 194,827 
SEG,) = ue ) = 1.49. 


187 ) (12)(24.922) 11 


The weight for each observation is wy = 187/12 = 15.5833; we can alternatively 
calculate >, using (5.16) as 


Mi 
So way 


3. Si _ 291,533 eyes 
" Mi 4659.41667 an 
i 
ieS j=l 


SAS software uses (5.16) to estimate yy. SAS code for calculating these estimates 
and producing the following output is given on the website. 
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Data Summary 


Number of Clusters 12 
Number of Observations 299 
Sum of Weights 4659.41667 
Statistics 
Std Error 
Variable N DF Mean of Mean 
score 299 11 62.568562 1.491578 


0.3 


The sum of the weights for the sample, 4659.41667, estimates the total number of 
students in the 187 high school algebra classes. = 


Two-Stage Cluster Sampling 


In one-stage cluster sampling, we observe all the ssus within the selected psus. In 
many situations, though, the elements in a cluster may be so similar that measuring all 
subunits within a psu wastes resources; alternatively, it may be expensive to measure 
ssus relative to the cost of sampling psus. In these situations, it may be much cheaper 
to take a subsample within each psu selected. The stages within a two-stage cluster 
sample, when we sample the psus and subsample the ssus with equal probabilities, 
are: 


1 Select an SRS S of n psus from the population of N psus. 


2 Select an SRS of ssus from each selected psu. The SRS of m; elements from the 
ith psu is denoted Sj. 


The difference between one-stage and two-stage cluster sampling is illustrated in 
Figure 5.2. The extra stage complicates the notation and estimators, as we need to 
consider variability arising from both stages of data collection. The point estimators 
of t and yy are analogous to those in one-stage cluster sampling, but the variance 
formulas become messier. 

In one-stage cluster sampling, we could estimate the population total by funy = 
(N/n) > - jc ti; the psu totals t; were known because we sampled every ssu in the 
selected psus. In two-stage cluster sampling, however, since we do not observe every 
ssu in the sampled psus, we need to estimate the individual psu totals by 


* M; _ 
i= 3 ee = My; 


jeSi 


FIGURE 5.2 


The difference between one-stage and two-stage cluster sampling. 


One-Stage 
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Two-Stage 


Population of N psuis: 


Population of N psu’s: 


skeae 


Take an SRS of n psu’s: 


aces 


ae 


Take an SRS of n psu’s: 


Sample all ssu’s in sampled psu’s: 


Take an SRS of mj; ssu’s in sampled psu i: 


“a 


ai 


and an unbiased estimator of the population total is 


A 


tunb = 


N 


icS 


icS 


, N 7 NM; 
pes | LA La (5.18) 


ieS jES; 
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For estimating means and totals in cluster samples, most survey statisticians use 
sampling weights. Equation (5.18) suggests that the sampling weight for ssu j of psu 
_ NM 2 : ; : i 
iis ——, and we can see that this is so by calculating the inclusion probability. For 
nm; 
cluster sampling, 


P(jth ssu in ith psu is selected) 
= P(ith psu selected) x Pith ssu selected | ith psu selected) 


Recall from Section 2.4 that the weight of an element is the reciprocal of the probability 
of its selection. Thus, 


(5.19) 


If psus are blocks, for example, and ssus are households, then household j in psu 
i represents (VM;)/(nm;) households in the population: itself, and (NM;)/(nm;) — 1 
households that are not sampled. Then, 


fund = D0 0 wads: (5.20) 
ieS jES; 

In two-stage cluster sampling, a self-weighting design has each ssu representing 
the same number of ssus in the population. To take a cluster sample of persons in 
Illinois, we could take an SRS of counties in Illinois and then take an SRS of m; of the 
M,; persons from county i in the sample. To have every person in the sample represent 
the same number of persons in the population, m; needs to be proportional to M;, so 
that m;/M; is approximately constant. Thus, we would subsample more persons in 
the large counties than in the small counties to have a self-weighting sample. 

The sampling weights provide a convenient way of calculating point estimates; 
they do not avoid associated shortcomings such as large variances. Also, the sampling 
weights give no information on how to find standard errors: We need to derive the 
formula for the variance using the sampling design. 

In two-stage sampling, the 7;’s are random variables. Consequently, the variance 
Of funb has two components: (1) the variability between psus and (2) the variability of 
ssus within psus. We do not have to worry about component (2) in one-stage cluster 
sampling. 

The variance of fynb in (5.18) equals the variance of fanb from one-stage cluster 
sampling plus an extra term to account for the extra variance due to estimating the 
7;’s rather than measuring them directly. For two-stage cluster sampling, 


x 2 n Se NS mj 55° 
VG») = N?(1- =) + — 90 (1-2 ) pe, (5.21) 
i=1 : } 


where S? is the population variance of the cluster totals, and Ss; is the population 
variance among the elements within cluster i. The first term in (5.21) is the variance 
from one-stage cluster sampling, and the second term is the additional variance due to 
subsampling within the psus. If m; = M; for each psu i, as occurs in one-stage cluster 
sampling, then the second term in (5.21) is 0. To prove (5.21), we need to condition 
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on the units included in the sample. This is more easily done in the general setting of 
unequal probability sampling; to avoid proving the same result twice, we shall prove 
the general result in Section 6.6.! 

To estimate V(funb)s let 


A 2 
1 A tunb 
= 4 > (i ‘ ) (5.22) 


iceS 
be the sample variance among the estimated psu totals and let 
1 
2 oe o— y-: 2 
saeco PL (5.23) 
jESj 


be the sample variance of the ssus sampled in psu i. As will be shown in Section 6.6, 
an unbiased estimator of the variance in (5.21) is given by 


An 2 n s? N mM; Ase 
V(tunb) = N (1 = ~) Pa + - 1— — |M; a (5.24) 
ieS 


The standard error, SE(funp), is of course the square root of (5.24). 


Remark. In many situations when N is large, the contribution of the second term in 
(5.24) to the variance estimator is negligible compared with that of the first term. We 
show in Section 6.6 that 


2 2 _ mij 2S} 
Els]= Si +x) ra hese 


i=1 


We expect the sample variance of the estimated psu totals ¢; to be larger than the 
sample variance of the true psu totals 4; because 7; will be different if we take a 
different subsample in psu i. Thus, if N is large, the first term in (5.24) is approximately 
unbiased for the theoretical variance in (5.21). To simplify calculations, most software 
packages for analyzing survey data (including SAS software) estimate the variance 
using only the first term of (5.24), often omitting the finite population correction (fpc), 
(1 — n/N). The estimator 


2 
VwrCan) = N? (5.25) 
n 


estimates the with-replacement variance for a cluster sample, as will be seen in Sec- 
tion 6.3. If the first-stage sampling fraction n/N is small, there is little difference 
between the variance from a with-replacement sample and that from a without- 
replacement sample. Alternatively, a replication method of variance estimation from 
Chapter 9 can be used. 


If we know the total number of elements in the population, Mp, we can estimate 
the population mean by Yunb = tanb /Mo with standard error SE(vunb) = SE(anb) /Mo. 


! Working with the additional level of abstraction will allow us to see the structure of the variance more 
clearly, without floundering in the notation of the special case of equal probabilities discussed in this 
chapter. If you prefer to see the proof before you use the variance results, read Section 6.6 now. 
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As in one-stage cluster sampling with unequal cluster sizes, s? can be very large 
since it is affected both by variations in the unit sizes (the M;) and by variations in the 
y;. If the cluster sizes are disparate, this component is large, even if the cluster means 
are fairly constant. 


Ratio Estimation. As in one-stage cluster sampling, we use a ratio estimator for the 
population mean. Again, the y’s of Chapter 4 are the psu totals (now estimated by #;) 
and the x’s are the psu sizes M;. As in (5.15), 


(5.26) 


Using the sampling weights in (5.19) with wj, = (NM;)/(nm;), we can rewrite Yr as 


>a 


a tunb ieS jeS; 
Yr => — - 
Mo Dwi 


The weights are different, but the form of the estimator is the same as in (5.16). The 
variance estimator is again based on the approximation in (4.10): 


(5.27) 


Ac ia 1 ny s2 1 m \ s? 
VGj=— (1 r4+——_ 2 (: - 7) as (5.28) 

us N n  nNM” dX Mj} m 

where - is defined in (5.23), 
1 K 
2 = ery) 

= My; — Miy,)’, 5.29 
s = (Misi — Mivs) (5.29) 


ieS 
and M is the average psu size. As with funy, the second term in (5.28) is usually 


negligible compared with the first term, and most survey software packages calculate 
the variance using only the first term. 


The data in the file coots.dat come from Arnold’s (1991) work on egg size and volume 
of American Coot eggs in Minnedosa, Manitoba. In this data set, we look at volumes 
of a subsample of eggs in clutches (nests of eggs) with at least two eggs available for 
measurement. 

The data are plotted in Figures 5.3 and 5.4. Data from a cluster sample can be 
plotted in many ways, and you often need to construct more than one type of plot to see 
features of the data. Because we have only two observations per clutch, we can plot 
the individual data points. If we had many observations per clutch, we could instead 
construct side-by-side boxplots, with one boxplot for each psu (we did a similar plot 
in Figure 3.1 for a stratified sample, constructing a boxplot for each stratum). We 
shall return to the issue of plotting data from complex surveys in Section 7.4. 

Next, we use a spreadsheet (partly shown in Table 5.2; the full spreadsheet is on 
the website) to calculate summary statistics for each clutch. The summary statistics 
may then be used to estimate the average egg volume and its variance. The numbers 
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FIGURE 5.3 

Plot of egg volume data. Note the wide variation in the means from clutch to clutch. This 
indicates that eggs within the same clutch tend to be more similar than two randomly selected 
eggs from different clutches, and that clustering does not provide as much information per egg 


as would an SRS of eggs. 
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FIGURE 5.4 

Another plot of egg volume data. Here, we ordered the clutches from smallest mean to largest 
mean, and drew the line connecting the two measurements of volume from the eggs in the 
clutch. Clutch number 88, represented by the long line in the middle of the graph, has an 
unusually large difference between the two eggs: One egg has volume 1.85, and the other has 


volume 2.84. 
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TABLE 5.2 
Part of Spreadsheet Used for Calculations in Example 5.7 


Clutch M; yi s? i; d- ig Gi — Min) 
I 13 3.86 0.0094 50.23594 0.671901 318.9232 
2 13 4.19 0.0009 54.52438 0.065615 490.4832 
3 6 0.92 0.0005 5.49750 0.005777 89.22633 
4 ll 3.00 0.0008 32.98168 0.039354 31.19576 
182 13 4.22 | 0.00003 54.85854 0.002625 505.3962 
183 13 4.41 0.0088 57.39262 0.630563 625.7549 
184 12 3.48 | 0.000006 41.81168 0.000400 142.1994 
sum 1757 4375.94652 42.174452 11,439.5794 
y= 2.490579 


have been rounded so that they fit on the page; in practice, of course, you should carry 
out all calculations to machine precision. 
We use the ratio estimator to estimate the mean egg volume. From (5.26), 


vi 


a 4375.947 
Gs EO EE 40, 
y. M, 1757 
icS 


From the spreadsheet (Table 5.2), 


i . s 11,439.58 

i __My.yY = ? a 

5 = oa (i; — Miy,) Te 62.51 
ieS 

and Ms = 1757/184 = 9.549. Using (5.28), then, 


06 es 1 1 184) 62.511 ‘i 1 42.17 

yr 95492 wn) 184 | N 184 |’ 

Now N, the total number of clutches in the population, is unknown but presumed to 
be large (and known to be larger than 184). Thus, we may take the psu-level fpc to be 


1, and note that the second term in the estimated variance will be very small relative 
to the first term. We then use 


SEG,) = 1 62.511 ~ 0.061 
= 9549V 184 
The estimated coefficient of variation for >, is 
SE(y,) 0.061 
Or) = —— = 0.0245. 


2.49 


SS?b 


r 
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TABLE 5.3 
Part of Spreadsheet for Egg Volume Calculations Using Relative Weights 


clutch csize volume relweight volume x relweight 
1 13 3.795757 6.5 24.67242 

1 13 3.93285 6.5 25.56352 

2 13 4.215604 6.5 27.40142 

2 13 4.172762 6.5 27.12295 

3 6 0.931765 3.0 2.795294 

3 6 0.900736 3.0 2.702209 
183 13 4.481221 6.5 29.12794 
183 13 4.348412 6.5 28.26468 
184 12 3.486132 6.0 20.91679 
184 12 3.482482 6.0 20.89489 
sum 3514 1757 4375.947 


We used a spreadsheet to illustrate calculating the variance estimate with the 
formulas. Survey software packages will calculate y, and its standard error for you. 
For this example, SAS code provided on the website results in the values: 


Std Error 
Variable N Mean of Mean 95% CL for Mean 
volume 368 2.490579 0.061040 2.37014533 2.61101179 


SAS software calculates the mean using (5.27) with the weights. The weight for egg 
j in clutch 7 is: 
NM; N M; 
wy = —-— = 
nm; 184 2 


Because N is unknown, we display the relative weights M;/2 in a spreadsheet 
(Table 5.3). Column 5 is set equal to y; times the relative weight; using (5.27), 
y, = 4378.3/1758 = 2.49. The weights do not allow us to calculate the standard 
error, however; we still need to use (5.28) for that. = 


In Example 5.7, we could only use the ratio estimator because we know neither 
N nor Mo. The M;’s, however, did not vary widely, so the unbiased estimator would 
probably have had similar coefficient of variation. If all the W;’s are equal, in fact, 
the unbiased estimator is the same as the ratio estimator (see Exercise 25); if the M;’s 
vary, the unbiased estimator often performs poorly. The next example illustrates that 
the unbiased estimator of f may have large variance when the cluster sizes are highly 
variable. 
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The case of the six-legged puppy. Suppose we want to estimate the average number 
of legs on the healthy puppies in Sample City puppy homes. Sample City has two 
puppy homes: Puppy Palace with 30 puppies, and Dog’s Life with 10 puppies. Let’s 
select one puppy home with probability 1/2. After the home is selected, then select 
2 puppies at random from the home, and use Junb to estimate the average number of 
legs per puppy. 

Suppose we select Puppy Palace. Not surprisingly, each of the two puppies sam- 
pled has four legs, so 7pp = 30 x 4 = 120. Then, using (5.18), an unbiased estimate 
for the total number of puppy legs in both homes is 


‘ 2s 
tunb = ye = 240. 


We divide the estimated total number of legs by the total number of puppies to estimate 
the mean number of legs per puppy as 240/40 = 6. 
If we select Dog’s Life instead, tp. = 10 x 4 = 40, and 
e 25 
tunb = 7 = 80. 
If Dog’s Life is selected, the unbiased estimate of the mean number of legs per puppy 
is 80/40 = 2. 

These are not good estimates of the number of legs per puppy. But the estimator is 
mathematically unbiased: (6 + 2)/2 = 4, so averaging over all possible samples results 
in the right number. The poor quality of the estimator is reflected in the very large 
variance of the estimator, calculated using (5.21): 


. 1\ 82 22 m; iy 
V(tanb) = {1 pg: 1 2 we 
ce ( >) Prat mn) ni 


L 


1 
= 36)G200) = 6400. 
The ratio estimator, on the other hand, is right on target: If Puppy Palace is selected, 
y, = 120/30 = 4; if Dog’s Life is selected, y, = 40/10 = 4. Because the estimate is 
the same for all possible samples, V(y,) = 0. a 


In general, the unbiased estimator of the population total is inefficient if the cluster 
sizes are unequal and ¢; is roughly proportional to M;. The variance of fi,» depends 
on the variance of the #;’s, and that variance may be large if the M;’s are unequal. 

The ratio estimator, however, generally performs well when f; is roughly pro- 
portional to M;. Recall from (4.7) that the approximate mean squared error (MSE) 
of the estimator B is proportional to the variance of the residuals from the model: 
Using the notation of this chapter, the approximate MSE of y,( = B)is proportional 
to Laer (t; — YyM;)*. When tf; (the response variable) is highly positively correlated 
with M; (the auxiliary variable), the residuals are small. In Example 5.8, the total 
number of puppy legs in a puppy home (f;) is exactly four times the total number of 
puppies in the home (Mj), so the variance of the ratio estimator is zero. 

This is an important issue, since many naturally occurring clusters are of unequal 
sizes, and we expect that the cluster totals will often be proportional to the number of 
ssus. In acluster sample of nursing homes, we expect that a larger number of residents 
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will be satisfied with the level of care in a home with 500 residents than in a home 
with 20 residents, even though the proportions of residents who are satisfied may be 
the same. The total of the math scores for all students in a class will be much greater 
for large classes than for small classes. In general, we expect to see more honeybees 
in a large area than a small area. For all of these situations, then, while the estimator y, 
works well, the estimator fynp tends to have large variability. In Chapter 6, we discuss 
an alternative design and estimator for cluster sampling that result in a much lower 
variance for the estimated population total when ¢; is proportional to Mj. 


04 


Designing a Cluster Sample 


Persons and organizations taking an expensive, large-scale survey need to devote a 
great deal of time to designing the survey; typically, large surveys administered by the 
U.S. Census Bureau take several years to design and test. Even then, the Fundamental 
Principle of Survey Design often holds true: You can best design the survey you should 
have taken after you have finished the survey. After the survey is completed, you can 
assess the effect of the clustering on the estimates, and know where you could have 
allocated more resources to obtain better information. 

The more you know about a population, the better you can design an efficient 
sampling scheme to study it. If you know the value of y, for every person in your 
population, then you can design a flawless (but unnecessary because you already 
know everything!) survey for studying the population. If you know very little about 
the population, chances are that you will gain information about it after collecting the 
survey, but you may not have the most efficient design possible for that survey. You 
may, however, be able to use your newly gained knowledge to make your next survey 
more efficient. 

When designing a cluster sample, you need to decide four major issues: 


What overall precision is needed? 
What size should the psus be? 


How many ssus should be sampled in each psu selected for the sample? 


hw nN = 


How many psus should be sampled? 


Question | must be faced in any survey design. To answer questions 2 through 4, you 
need to know the cost of sampling a psu for possible psu sizes, the cost of sampling 
a ssu, and a measure of homogeneity (R? or ICC) for the possible sizes of psu. 


BA Choosing the psu Size 


The psu size is often a natural unit. In Example 5.7, a clutch of eggs was an obvious 
cluster unit. A survey to estimate calf mortality might use farms as the psus; a survey 
of sixth-grade students might use classes or schools as the psus. 

In other surveys, however, the investigator may have a wide choice for psu size. 
In a survey to estimate the sex and age ratios of mule deer in a region of Colorado [see 
Bowden et al. (1984) for more discussion of the problem], psus might be designated 
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TABLE 5.4 
Relative Net Precision in the Potato Beetle Study 


Number of Stems Cost to 
Sampled Sample Relative 
per Site y SE(y) One Field Net Precision 
1 1.12 0.15 31.67 0.24 
2 1.01 0.10 33.33 0.30 
3 0.96 0.08 35.00 0.34 
4 0.91 0.07 36.67 0.35 
5 0.91 0.06 38.33 0.40 


areas and ssus might be individual deer or groups of deer in those areas. But should 
the size of the psus be 1 km2, 2 km?, or 100 m2? 

A general principle in area surveys is that the larger the psu size, the more vari- 
ability you expect to see within a psu. Hence you expect R and ICC to be smaller 
with a large psu than with a small psu. If the psu size is too large, however, you may 
lose the cost savings of cluster sampling. 

Bellhouse (1984) reviews optimal designs for sampling, and the theory provides 
useful guidance for designing your own survey. There are many ways to “try out” 
different psu sizes before taking your survey. One way is to postulate a model for 
the relationship between R or MSW and M, and to fit the model using preliminary 
data or information from other studies. Then use different combinations of R? and M, 
and compare the costs. Another way is to perform an experiment and collect data on 
relative costs and variances with different psu sizes. 


The Colorado potato beetle has long been considered a major pest of potatoes. 
Zehnder et al. (1990) studied different sizes of sampling units that could be used 
to estimate potato beetle counts. Ten randomly selected sites were sampled from 
each of ten fields. The investigators visually inspected each site for small larvae, 
large larvae, and adults on all foliage from a single stem on each of five adjacent 
plants. 

They then considered different sizes of psu, ranging from one stem per site to five 
stems per site. To study the efficiency of a one-stem-per-site design, they examined 
data from stem | of each site. Similarly, the data from stems | and 2 of each site 
gave a cluster sample with two ssus per psu, and so on. It takes about 30 minutes 
to walk among the sites in each field; sampling one stem requires about 10 seconds 
during the early part of the season. Thus the total cost to sample all ten sites with the 
one-stem-per-site design is estimated to be 30 + 100/60 = 31.67 minutes. Data for 
estimating the number of small larvae are given in Table 5.4. 

The relative net precision is calculated as 1000/ [(cost)CV(y)]. For this exam- 
ple, since the cost to sample additional stems at a site is small compared with the 
time to traverse the field, the five-stem-per-site design is most efficient among those 
studied. a 
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b 42 Choosing Subsampling Sizes 


EXAMPLE 5.10 


The goal in designing a survey is generally to get the most information possible for the 
least cost and inconvenience. In this section, we concentrate on designing a two-stage 
cluster survey when all psus have the same number, M, of ssus; designing cluster 
samples will be treated more generally in Chapters 6 and 7. One approach for equal- 
sized clusters, discussed in Cochran (1977), is to minimize the variance in (5.21) for 
a fixed cost. If M; = M and m; = m for all psus, then VGuai) may be rewritten (see 
Exercise 24) as: 


") MSB (1 m) MSW 


Von =(1 
(Yunb) nN) aM Yj 


(5.30) 
nm 
where MSB and MSW are the between and within mean squares, respectively, in 
Table 5.1, the population ANOVA table. 

If MSW = O and hence R2 = 1, for R? defined in (5.11), then each element within 
a psu equals the psu mean. In that case you may as well take m = 1; examining 
more than one element per psu just costs extra time and money without increasing 
precision. For other values of R2, the optimal allocation depends on the relative costs 
of sampling psus and ssus. 

Consider the simple cost function 


total cost = C = cin+ conm, (5.31) 


where c, is the cost per psu (not including the cost of measuring ssus) and cz is the 
cost of measuring each ssu. One can easily determine, using calculus, that the values 


Cc 


Nopt = —— 
7 C1 + C2Mopt 


and 


= _ pR2 
rg = DS Ru (5.32) 


(NM — 1)R2 


minimize the variance for fixed total cost C under this cost function (see Exercise 27); 
often, though, a number of different values will work about equally well, and graphing 
the projected variance of the estimator will give more information than merely com- 
puting one fixed solution. A graphical approach also allows you to perform what-if 
analyses on the designs: What if the costs or the cost function are slightly different? 
Or the value of R? is changed slightly? You can also explore different cost functions 
with this approach. 

In (5.32), the value R2 is from the population ANOVA table. In practice, we can 


estimate it from pilot survey data by R = MSW / $2. In large populations, the ratio 


M(N — 1)/(NM — 1) will be close to 1, so we can use Mop = faa _ R2)/(coR?2). 


Would subsampling have been more efficient for Example 5.2 than the one-stage 
cluster sample that was used? We do not know the population quantities, but have 
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FIGURE 5.5 

Estimated variance that would be obtained for the GPA example, for different values of c; and 
c) and different values of m. The sample estimate of 0.337 was used for R2. The total cost used 
for this graph was C = 300. If it takes 40 minutes per suite and 5 minutes per person, then 
one-stage cluster sampling should be used; if it takes 10 minutes per suite and 20 minutes per 
person, then only one person should be sampled per suite; if it takes 20 minutes per suite and 
10 minutes per person, the minimum is reached at m ~ 2, although the flatness of the curve 
indicates that any subsampling size would be acceptable. 
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FIGURE 5.6 

Estimated variance that would be obtained for the GPA example, for different values of R? and 
different values of m. The costs used in constructing this graph were C = 300, c; = 20, and 
C2 = 10. The higher the value of R?, the smaller the subsample size m should be. 
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information from the sample that can be used for planning future studies. Recall that 
5? = 0.279, and we estimated R> by 0.337. Figures 5.5 and 5.6 show the estimated 
variance that would be achieved for different subsample sizes for different values of 
c, and co, and for different values of Ke . 


For design purposes, we only need a rough estimate of R? or of MSW and MSB. 
The adjusted R* from the ANOVA table from sample data usually provides a good 
starting point, even though the sample value of the mean square total often underes- 
timates S? when the number of psus in the sample is small (see Exercise 26). 


5.4 Designing a Cluster Sample 195 


EXAMPLE 5.11 We obtain the following ANOVA table for the coots data in Example 5.7. 


Sum of Mean 
Source DF Squares Square F Value 
Model 183 257.4175336 1.4066532 237.44 
Error 184 1.0900782 0.0059243 
Corrected Total 367 258.5076118 
R-Square Ce. Root MSE VOLUME Mean 
0.995783 3.298616 0.076970 2.333394 


Ifa future survey were planned to estimate average egg volume, one might explore 
subsample sizes using R? *s around | — 0.0059243/(258.5/367) = 0.99. These data indi- 
cate a high degree of homogeneity within clutches for egg volume. For this survey, 
however, locating and accessing a clutch is much more time consuming than mea- 
suring the eggs in a clutch. Thus, it might be best to take m; = M; despite the high 
degree of homogeneity, because the additional information can be used to answer other 
research questions concerning variability from clutch to clutch or possible effects of 
egg-laying sequence. um 


Although we discussed only designs where all M;’s are equal, we can use these 
methods with unequal M,’s as well: just substitute M for M in the above work, and 
decide the average subsample size m to take. Then either take m observations in every 
cluster, or allocate observations so that 

mj 

—— = constant. 

L 

As long as the M;’s do not vary too much, this should produce a reasonable design. 
If the M;’s are widely variable, and the 1;’s are correlated with the M;’s, a cluster 
sample with equal probabilities is not necessarily very efficient; an alternative design 
is presented in Chapter 6. The file clusterselect.sas tells how to select a two-stage 
cluster sample using SAS software. 


bg Choosing the Sample Size (Number of psus) 


After the psu size is determined and the subsampling fraction set, we then look at the 
number of psus to sample, n. Like any survey design, design of a cluster sample is an 
iterative process: (1) Determine a desired precision, (2) choose the psu and subsample 
sizes, (3) conjecture the variance that will be achieved with that design, (4) set n to 
achieve the precision, and (5) iterate (adding stratification and auxiliary variables to 
use in ratio estimation) until the cost of the survey is within your budget. 
If clusters are of equal size and we ignore the psu-level fpc, (5.30) implies that 
MSB m\ MSW 1 
sae ier 
M M 


A 1 
VOunb) < 
n 


m 


0.9 
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An approximate 100(1 — a)% CI will be 


a /1 
Yunb © Za/24/ — V- 
n 


Thus, to achieve a desired CI half-width e, setn = a py / e”. Of course, this approach 
presupposes that you have some knowledge of v, perhaps from a prior survey. In 
Section 7.5, we examine how to determine sample sizes for any situation in which 
you know the efficiency of the specified design relative to an SRS design. 


Systematic Sampling 


Systematic sampling, discussed briefly in Chapter 2, is really a special case of cluster 
sampling. Suppose we want to take a sample of size 3 from a population that has 12 
elements: 


123 45 67 8 9 10 11 12. 


To take a systematic sample, choose a number randomly between | and 4. Draw that 
element and every fourth element thereafter. Thus, the population contains four psus 
(they are clusters even though the elements are not contiguous): 


{1,5,9} {2,6,10} {3,7, 11} {4,8, 12}. 


Now we take an SRS of one psu. 

In a population of NM elements, there are N possible choices for the systematic 
sample, each of size M. We observe only the mean of the one psu that comprises our 
systematic sample, 


Ji = Yiu = oe 


From the results in Section 5.2.1, EF [veal = yy. For a simple systematic sample, we 
select n = 1 of the N psus, so by (5.5) and (5.10), the theoretical variance is 


VGms) = (1-4) = (1-1) MB LS ae aiccy. 6.33) 
yy" NJ) me ON) MM p+ ie 


In the notation for cluster sampling, M is the size of the systematic sample. Ignoring 
the fpc, we see that systematic sampling is more precise than an SRS of size M if the 
ICC is negative. Systematic sampling is more precise than simple random sampling 
when the variance within the possible systematic samples (psus) is /Jarger than the 
overall population variance—then the psu means will be more similar. If there is little 
variation within the systematic samples relative to that in the population (that is, ICC 
is large), then the elements in the sample all give similar information, and systematic 
sampling would be expected to be have higher variance than an SRS. 

Since n = 1, however, we cannot calculate V (sys) using (5.6); we need to know 
something about the structure of the population to estimate the variance. Let’s look 
at three different population structures. 


1 The list is in random order. Systematic sampling is likely to produce a sample that 
behaves like an SRS. In many situations, the ordering of the population is unrelated 
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to the characteristics of interest, as when the list of persons in the sampling frame 
is ordered by the last four digits of their telephone numbers. There is no reason to 
believe that the persons in a systematic sample will be more or less similar than 
a random sample of persons: We expect that ICC ~ 0. In this situation, simple 
random and systematic sampling will give similar results. We can use SRS results 
and formulas to estimate VOVsys)- 


x x xXx 


Position in Sampling Frame 


The sampling frame is in increasing or decreasing order. Systematic sampling is 
likely to be more precise than simple random sampling. Financial records may 
be listed with the largest amounts first and the smallest amounts last. Such a 
population is said to have positive autocorrelation: adjacent elements tend to be 
more similar than elements that are farther apart. In this case, Visys) is less than the 
variance of the sample mean in an SRS of the same size since ICC < 0. A systematic 
sample forces the sample values to be spread out; it is possible that an SRS would 
consist of all low values or all high values. When the frame is in increasing 
or decreasing order, you may use the SRS formula for standard error, but it will 
likely be an overestimate and CIs constructed using the SRS standard error will be 
too wide. 


Position in Sampling Frame 


Stratified sampling may work better than systematic sampling for positively 
autocorrelated populations: If the random start is close to either end of the sampling 
interval, it will tend to give an estimate that is too low or too high. 


The sampling frame has a periodic pattern. If we sample at the same interval 
as the periodicity, systematic sampling will be less precise than simple random 
sampling. Systematic sampling is most dangerous when the population is in a 
cyclical or periodic order, and the sampling interval coincides with a multiple of the 
period. 
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Position in Sampling Frame 


Suppose the population values (in order) are 
L231 2 3 2S 1 2 3 


and the sampling interval is 3. Then all elements in the systematic sample will 
be the same; if we use the SRS formula to estimate the variance, we will have 
VOsys) = 0. But the true value of VOisys) for this population is 2/3 = S?; this 
sample is no more precise than a single observation chosen randomly from the 
population. 


Systematic sampling is often used when a researcher wants a representative sample 
of the population, but does not have the resources to construct a sampling frame in 
advance. Itis commonly used to select elements at the bottom stage of a cluster sample. 
In many situations in which systematic sampling is used, the systematic sample can 
be treated as if it were an SRS. 


Sampling for Hazardous Waste Sites. Many dumps and landfills in the United 
States contain toxic materials. These materials may have been sealed in containers 
when deposited, but may now be suspected of leaking. But we no longer know where 
the materials were deposited—containers of hazardous waste may be randomly dis- 
tributed throughout the landfill, or they may be concentrated in one area, or there may 
be none at all. 

A common practice is to take a systematic sample of grid points and to take soil 
samples from each to look for evidence of contamination. Choose a point at random 
in the area, then construct a grid containing that point so that grid points are an equal 
distance apart. One such grid is shown in Figure 5.7. The advantages of taking a 
systematic sample rather than an SRS are that the systematic sample forces an even 
coverage of the region and is easier to implement in the field. If you are not worried 
about periodic patterns in the distribution of toxic materials, and you have little prior 
knowledge where the toxic materials might be, a systematic sample is a good design. 

With any grid in systematic sampling, you need to worry if the toxic materi- 
als are regularly placed so that the grid may miss all of them, as shown in Fig- 
ure 5.8. If this is a concern, you would be better off taking a stratified sample. Lay 
out the grid, but select a point at random in each square at which to take the soil 
sample. um 


If periodicity is a concern in a population, one solution is to use interpenetrating 
systematic samples (Mahalanobis, 1946). Instead of taking one systematic sample, 
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FIGURE 5.7 
A grid used for detecting hazardous wastes 
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FIGURE 5.8 
A grid used for detecting hazardous wastes: the worst-case scenario. Since the waste occurs in 
a similar pattern to the grid, the systematic sample misses every deposit of toxic waste. 
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take several systematic samples from the population. Then you can use the formulas 
for cluster samples to estimate variances; each systematic sample acts as one psu. 
This approach is explored in Exercise 21. 
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0.6 


Model-Based Inference in Cluster Sampling* 


The one-way ANOVA model with fixed effects provides a theoretical framework for 
stratified sampling; one possible analogous model for cluster sampling is the one-way 
ANOVA model with random effects (Scott and Smith, 1969). Let’s look at a simple 
version of this model: 


M1: ¥, = "+A; + ey (5.34) 


with A; generated by a distribution with mean 0 and variance 03, € generated by a 
distribution with mean 0 and variance o”, and all A;’s and €;;'S independent. 

Let T; = ae Y;;. Model M1 implies that the expected total for a cluster increases 
linearly with the number of elements in the cluster, because Eyy\ [Yj] = and 


Mi 


Ey [Tj] = Ew| > Yy| = Mim. 
j=l 


This assumption is often appropriate for cluster samples taken in practice. Suppose we 
are taking a two-stage cluster sample to estimate total hospital charges for delivering 
babies; hospitals are selected at the first stage, and birth records are selected at the 
second stage (twins and triplets count as one record). We expect total costs billed by 
a hospital to be larger if the hospital delivers more babies. 

The average cost per birth, however, varies from hospital to hospital—some hos- 
pitals may have higher personnel costs, and others may serve a higher-risk population 
or have more expensive equipment. That variation is reflected in the model by the 
random effects A;: A; is the random variable representing the average cost per birth in 
the ith hospital minus jz, and 03 is the population variance among the hospital means. 
In addition, costs vary from birth to birth within the hospitals; that variation is incor- 
porated into the model by the term ¢ with variance o”. These ideas are illustrated in 
Figure 5.9, presuming that the A;’s and e€,;’s are normally distributed. 

Figure 5.9 illustrates that, according to the model in (5.34), costs for births in the 
same hospital tend to be more similar than costs for births selected randomly across 
the entire population of hospital births, because the cost for a birth in a given hospital 
incorporates the hospital characteristics such as personnel costs or nurse/patient ratios. 
The model-based intraclass correlation coefficient for Model M1 is defined to be 


ee (5.35) 
ee tot ; 


Note that o in Model M1 is always nonnegative, in contrast to ICC which can take 
on negative values.” Thus, if Model M1 describes the data, cluster sampling must be 


2Model M1, with p > 0, would not be appropriate if there is competition within clusters so that one 
member of a cluster profits at the expense of another. For example, if other environmental factors can be 
discounted, competition within the uterus might cause some fraternal twins to be more variable than 
non-twin full siblings. 
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FIGURE 5.9 
Illustration of random effects for hospitals and births. 
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less efficient than an SRS of equal size. With Model M1, 
o+oi ifi=kandj=! 
CovmilYij, Yul = 4 0% ifi=kandjAl. 
0 ifi xk 
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Now let’s find properties of various estimators under Model M1. To save some work 


later, we look at a general linear estimator of the form 


T= bays 


ieS jES; 


for by any constants. The random variable representing the finite population total is 
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Inference in a model-based approach is conditional on the units selected to be in 
the sample; that is, the inference treats S and S; as fixed, and treats Yj; as a random 
variable. Then, the bias is 


N M,; 
Emil? - 71 = Emi| > Yo by¥y - Yi 
ieS jeS; i=l j=1 
= ODS, — Mo) 
icgS jESj 


Thus, T is model-unbiased when pare 8 LujeS; bi = Mo. The model-based (for model 
M1) variance of T=Tis 


Vuilr 7 l=oal 3 (Soy- mi) + )oM?| + o*[ 5 Y> G} - 265) + Mo. 


ieS jeSi i¢S iEeS jESj 
(5.36) 


(See Exercise 30.) 
Now let’s look at what happens with design-based estimators under Model M1. 
The random variable for the design-unbiased estimator is 


x NM; 
Tunb = y ) Yij; 
Dee nm; 
ieS jES; 


the coefficients bj; are the sampling weights (NM;)/(nm;). But 


Srna t yy aly, 


ieS jES; ieS jES; iceS 
so the bias under model M1 in (5.34) is 
N 
n(— - Mj; — Mo). 
ieS 
Note that the bias depends on which sample is taken, and the estimator is model- 
unbiased under (5.34) only when the average of the M;’s in the sample equals the 


average of the M;’s in the population, such as will occur when all M;’s are the same. 
For the ratio estimator, the coefficients are bj = Mo(M;/m;)/ Le s Mx, and 


Tapes 


rss mM; 
ieS jES; 


DM 


keS 


ie > 
| 


For these b,;’s, 


Pp Die reali 
icS jeS; ieS jeS; Mi > Mk 
keS 
so the ratio estimator is model-unbiased under Model M1. If Model M1 describes 
the population, then the ratio estimator adjusts for the sizes of the particular psus 
chosen for the sample; it uses M;, a quantity that is correlated with the ith psu total, 


EXAMPLE 5.13 


5.6 Model-Based Inference in Cluster Sampling 203 


to compensate for the possibility that the sample may have a different proportion of 
large psus than does the population. 

The variance expression in (5.36) is complicated; if M; = M and m; = m for all i, 
then ae — i bi = (NM)(nm), and the variance in (5.36) simplifies to 


A 


2 2 
Vai Tans — T] = MoM(N —n)24 + Mo(MN — mn)—. (5.37) 
n mn 


Let’s return to the puppy homes discussed in Example 5.8. They certainly follow 
Model M1: All puppies have four legs, so Yj; = yz = 4 for all i and j. Consequently, 
Oo; = o” = 0. The model-based variance of the estimate Te is therefore 0, no matter 
which puppy home and puppies are chosen. If Puppy Palace is selected for the sample, 
the bias under the model in (5.34) is 4(2 x 30 — 40) = 80; if Dog’s Life is selected, the 
bias is 4(2 x 10 — 40) = —80. The large variance in the design-based approach thus 
becomes a bias when a model-based approach is adopted. It is not surprising that 
a performs poorly for the puppy homes; it is a poor estimator for a model that 
describes the situation well. The model-based bias and variance for T. though, are 
both zero. a 


The above results are only for Model M1. Suppose that a better model for the 
population is 


M2: Yj = B; + Ej, (5.38) 


with E[B;] = L/Mi, V[M;B;] = Ga Ele] = 0, V[ex] = o, and all B; and ij 
independent. Under Model M2, then, the cluster totals all have expected value ju, 
regardless of cluster size. Examples that are described by this model are harder to 
come by in practice, but let’s construct one based on the principle that tasks expand 
to fill up the allotted time. All students at Idyllic College have 100 hours available 
for writing term papers, but an individual student may have from one to five papers 
assigned. It would never occur to an Idyllic student to finish a paper quickly and 
relax in the extra time, so a student with one paper spends all 100 hours on the paper, 
a student with two papers spends 50 on each, and so on. Thus, the expected total 
amount of time spent writing term papers, E[7;], is 100 for each student, although the 
numbers of papers assigned (M;) vary. 
The estimator La is unbiased under Model M2: 


NM, 
Exalfu — T] = Eval >) aay y ¥5 


7 ‘ nm; , F 
ieS jES; i=1 j=1 
N M: 
: nm; M; M; . 
ieS jES; i=1 j=l 


Thus, Tunb performs poorly if model (5.34) is appropriate, but often quite well if 
model (5.38) is appropriate. Of course, these are not the only two possible models: 
Royall (1976) derives results for a general class of possible models that includes 
both (5.34) and (5.38), and allows unequal variances for different clusters. Chapter 8 
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FIGURE 5.10 
Plot of 7; vs. M;, for the coots data. 
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of Valliant et al. (2000) describes other correlation structures for clustered 
populations. 

If you decide to use a model-based approach to analyze cluster sample data, you 
need to be very careful that the model chosen is appropriate. We saw in the puppy 
example that the Model M1 variance for Tis is 0 but the bias is large; we could only 
evaluate the bias because we knew the results for the whole population. A person who 
sampled only Puppy Palace and did not know the results for Dog’s Life would not 
be able to evaluate the bias, and might conclude that puppies average six legs each! 
Thus, assessing the adequacy of the model is crucial in any model-based analysis. You 
must check the assumption that V[e;;] = o” by plotting the variances of each cluster, 
just as you assess the equal variance assumption in ANOVA. A plot of 7; versus M; is 
often useful in assessing the appropriateness of a model for the data in the sample. 
As always in model-based inference, we must assume that the model also holds for 
population elements not in the sample. 


Let’s fit Model M1, a one-way random effects model, to the coots data. Looking at 
Figure 5.4, it seems plausible (except for one clutch) that the within-clutch variance 
is the same for each clutch. Figure 5.10 shows the plot of 7; vs. M; for the coots 
data. a 

For these data, Corr(t;,M;) = 0.97. If Model M1 is appropriate for the data, we 
expect that 7; will increase with M;; if Model M2 were appropriate, we would expect 
a horizontal line to fit the plotted points. For these data, 7; and M; are clearly related, 
although the relationship does not appear to be a straight line. 

Using SAS PROC MIXED, the estimated variance components are oy = 0.70036 
and 67 = 0.00592. Using bi = Mi/(m; Yo.<5 Mx), the estimated mean egg volume is 
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2.492; adapting (5.36) to ignore the fpc (see Exercise 30), the estimated model-based 
variance is 


2 
= ——_ } ai+ (se ) 6° = 0.003944+0.000017 = 0.00396. 
dX (= ua) a ae Ties Mk 


If a different model were adopted, the estimated variance would be different. = 


Most statisticians who use model-based analyses with cluster samples adopt a 
model such as M1 in (5.34) to estimate the population mean jz or regression parame- 
ters. Binder and Roberts (2003) describe the use of models for this situation. We shall 
return to models for cluster samples in Section 11.5. 


6.2 Design Using Models 


0.1 


Models are extremely useful for designing a cluster sample. Using a model for design 
does not mean you have to use a model for analysis of your survey data when it 
is collected; rather, the model provides a useful way of summarizing information 
you can use to make the survey more efficient. Much research has been done on 
using models for design: see Rao (1979b), Bellhouse (1984), and Royall (1992b) for 
literature reviews. 

Suppose that Model M1 seems reasonable for your population, and that all psu 
sizes in the population are equal. Then you would like to design the survey to minimize 
the variance in (5.37), subject to cost constraints. Then, using the cost function in 
(5.31), the model-based variance is minimized when 


Suppose that the M;’s are unequal and that Model M1 holds. We can use the 
variance in (5.36) to determine the optimal subsampling size m; for each cluster. This 
approach was used by Royall (1976) for more general models than considered in this 
section. For T. by = Mi/(m; Dees M,), and the variance is minimized when m; is 
proportional to M; (see Exercise 32). 


Chapter Summary 


Cluster sampling is commonly used in large surveys, but estimates obtained from 
cluster samples usually have greater variance than if we were able to measure the 
same number of observation units using an SRS. If it is much less expensive to 
sample clusters than individual elements, though, cluster sampling can provide more 
precision per dollar spent. 

All of the formulas in this chapter for cluster sampling with equal probabilities 
are special cases of the general results for two-stage cluster sampling with unequal 
psu sizes, to be derived in Chapter 6. They can be applied to any two-stage cluster 
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sample in which the psus are selected with equal probability. These formulas were 
given in (5.18), (5.24), (5.26), and (5.28) and are repeated here: 


‘: No. N 7 
tunb = — a iYis . 
‘ =p S, 5, Mi (5.39) 
icS icS 
AsgK n\ s2 N Mm; s2 
VGanb) = N? (1-=)= = 1-7) we, 5.40 
ane eres M,) Mi (5.40) 
icS 
> Miy; 
i= SS _, (5.41) 
Mi 
ieS 


with 
1 aoe 
2, a unb 
= tj 
* —>( a 
le 
and 
a, 1 é 
— My; — Mi, ee 
5 =D (Mii — Mis) 


icS 
In one-stage cluster sampling, the second term in (5.40) and (5.42) is zero since 
m, = Mj. The variance estimators depend mostly on the variability between psus. 
Point estimates of the population mean and total are usually calculated using 
weights. If an SRS of n of the N population psus is chosen, and an SRS of m; of the 
Mj ssus in psu 7 is taken, then the sampling weight for observation j of psu i is 


NM; 


nm; 


tunb = > > Wi Sij 


iceS jES; 


DD, wiry 


A icS jES; 


>a 


ieS jES; 


Wii = 


Then 


and 


While weights can be used to find estimated means and totals, they do not provide 
sufficient information to estimate the variance in a cluster sample. You need to use 
the formulas in (5.40) and (5.42), or a method such as jackknife from Chapter 9, to 
calculate standard errors. 
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Exercises 
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Key Terms 


Cluster: See Primary sampling unit. 


Cluster sampling: A probability sampling design in which observations are grouped 
into clusters (psu). A probability sample of psus is selected from the population of 
psus. 


Intraclass correlation coefficient (ICC): The Pearson correlation coefficient of all 
pairs of units within the same cluster. 


One-stage cluster sampling: A cluster sampling design in which all ssus in selected 
psus are observed. 


Primary sampling unit (psu): The unit that is sampled from the population. 


Secondary sampling unit (ssu): A subunit that is subsampled from the selected 
psus. 


Two-stage cluster sampling: A cluster sampling design in which the ssus in selected 
psus are subsampled. 


For Further Reading 


Stuart (1984) gives a great deal of intuition into cluster sampling with clear illus- 
trations and examples. Cochran’s (1977) classic book thoroughly covers the theory 
of unbiased estimation in cluster samples; Cochran (1939) used ANOVA tables in 
sample surveys. Skinner et al. (1989b) and Binder and Roberts (2003) delineate the 
issues involved in different approaches to inference in cluster samples. Royall (1976) 
applies best linear unbiased estimation to finite population sampling problems with 
naturally occurring clusters. The book by Valliant et al. (2000) describes model-based 
methods for cluster samples. 

The classic paper by Mahalanobis (1946) gives insight into many issues in survey 
sampling. Among other concepts, Mahalanobis developed the technique of interpen- 
etrating subsampling, in which the sample is drawn as two smaller, independent sub- 
samples. We mentioned this technique briefly for estimating the variance of systematic 
samples. Ultimately, Mahalanobis’ idea led to the replication methods (discussed in 
Sections 9.2 and 9.3) now commonly used for variance estimation in complex surveys. 


A. Introductory Exercises 


A city council of a small city wants to know the proportion of eligible voters that 
oppose having a incinerator of Phoenix garbage opened just outside of the city limits. 
They randomly select 100 residential numbers from the city’s telephone book that 
contains 3,000 such numbers. Each selected residence is then called and asked for 
(a) the total number of eligible voters and (b) the number of voters opposed to the 
incinerator. A total of 157 voters were surveyed; of these, 23 refused to answer the 
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question. Of the remaining 134 voters, 112 opposed the incinerator, so the council 
estimates the proportion by 


p = 112/134 = .83582 
with 
V(H) = .83582(1 — .83582)/134 = 0.00102. 
Are these estimates valid? Why, or why not? 


Senturia et al. (1994) describe a survey taken to study how many children have 
access to guns in their households. Questionnaires were distributed to all parents 
who attended selected clinics in the Chicago area during a one-week period for well 
or sick child visits. 


a Suppose that the quantity of interest is percentage of the households with guns. 
Describe why this is a cluster sample. What is the psu? The ssu? Is it a one-stage or 
two-stage cluster sample? How would you estimate the percentage of households 
with guns, and the standard error of your estimate? 


b What is the sampling population for this study? Do you think this sampling pro- 
cedure results in a representative sample of households with children? Why, or 
why not? 


Kleppel et al. (2004) report on a study of wetlands in upstate New York. Four wetlands 
were selected for the study: Two of the wetlands drain watersheds from small towns 
and the other two drain suburban watersheds. Quantities such as pH were measured 
at two to four randomly selected sites within each of the four wetlands. 


a Describe why this is a cluster sample. What are the psus? The ssus? How would 
you estimate the average pH in the suburban wetlands? 


b_ The authors used Student’s two-sample rf test to compare the average pH from the 
sites in the suburban wetlands with the average pH from the sites in the small town 
wetlands, treating all sites as independent. Is this analysis appropriate? Why, or 
why not? 


Survey evidence is often introduced in court cases involving trademark violation and 
employment discrimination. There has been controversy, however, about whether 
nonprobability samples are acceptable as evidence in litigation. Jacoby and Handlin 
(1991) selected 26 from a list of 1285 scholarly journals in the social and behavioral 
sciences. They examined all articles published during 1988 for the selected journals, 
and recorded (1) the number of articles in the journal that described empirical research 
from a survey (they excluded articles in which the authors analyzed survey data which 
had been collected by someone else) and (2) the total number of articles for each 
journal which used probability sampling, nonprobability sampling, or for which the 
sampling method could not be determined. The data are in file journal.dat. 


a_ Explain why this is a cluster sample. 


b Estimate the proportion of articles in the 1285 journals that use nonprobability 
sampling, and give the standard error of your estimate. 


ce The authors conclude that, because “an overwhelming proportion of ... 
recognized scholarly and practitioner experts rely on non-probability sampling 
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designs,” courts “should have no problem admitting otherwise well-conducted 
non-probability surveys and according them due weight” (p. 175). Comment on 
this statement. 


5 A language school owner takes an SRS of 10 of the 72 Introductory Spanish classes 
offered by the school. Each student in each of the sampled classes is given a vocabulary 
test and is also asked whether he or she is planning a trip to a Spanish-speaking country 
in the next year. The data are in file spanish.dat. 


a__ Estimate the total number of students planning a trip to a Spanish-speaking country 
in the next year, and give a 95% CI. 


b Estimate the mean vocabulary test score for Introductory Spanish students in the 
language school, and give a 95% CI. 


6 An inspector samples cans from a truckload of canned creamed corn to estimate the 
total number of worm fragments in the truckload. The truck has 580 cases; each case 
contains 24 cans. The inspector samples 12 cases at random, and subsamples 3 cans 
randomly from each selected case. 


Case 
1 2 3 4 5 6 7 8 9 10 11 12 


Can 1 1 4 0 3 4 0 5 3 7 3 4 0 
Can 2 5 2 1 6 7 
Can 3 7 4 2 6 8 3 1 2 5 4 9 0 


\o 
So 
ww 
— 
~ 
oO 


Using (5.20) and (5.24), estimate the total number of worm fragments, along with a 
95% CI. Compare the estimated value of the variance from (5.24) with the approxi- 
mation that is used by SAS software, given in (5.25). 


7 The new candy Green Globules is being test-marketed in an area of upstate New York. 
The market research firm decided to sample 6 cities from the 45 cities in the area and 
then to sample supermarkets within cities, wanting to know the number of cases of 
Green Globules sold. 


Number of 
City Supermarkets Number of Cases Sold 
1 52 146, 180, 251, 152, 72, 181, 171, 361, 73, 186 
2 19 99, 101, 52, 121 
3 37 199, 179, 98, 63, 126, 87, 62 
4 39 226, 129, 57, 46, 86, 43, 85, 165 
5 8 12, 23 
6 14 87, 43, 59 


Obtain summary statistics for each cluster. Plot the data, and estimate the total number 
of cases sold, and the average number sold per supermarket, along with the standard 
errors of your estimates. 


8 A homeowner with a large library needs to estimate the purchase cost and replacement 
value of the book collection for insurance purposes. She has 44 shelves containing 
books, and selects 12 shelves at random. To prepare for the second stage of sampling, 
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she counts the number of books M;, on the selected shelves. She generates five ran- 
dom numbers between | and M; for each selected shelf, to determine which specific 
books, numbered from left to right, to examine more closely. She then looks up the 
replacement value for the sampled books in Books in Print. The data are given in the 
file books.dat. 


a Draw side-by-side boxplots for the replacement costs of books on each shelf. 
Does it appear that the means are about the same? The variances? 


b Estimate the total replacement cost for the library, and find the standard error of 
your estimate. What is the estimated coefficient of variation? 


c Estimate the average replacement cost per book, along with the standard error. 
What is the estimated coefficient of variation? 


Repeat Exercise 8 for the purchase cost for each book. Plot the data, and estimate the 
total and average amount she has spent for books, along with the standard errors. 


Construct a sample ANOVA table for the replacement cost data in Exercise 8. What 
is your estimate for R?? Do books on the same shelf tend to have more similar 
replacement costs? Suppose that c; = 10 and cz = 4. If all shelves had 30 books, 
how many books should be sampled per shelf? 


B. Working with Survey Data 


An accounting firm is interested in estimating the error rate in a compliance audit it is 
conducting. The population contains 828 claims, and the firm audits an SRS of 85 of 
those claims. In each of the 85 sampled claims, 215 fields are checked for errors. One 
claim has errors in 4 of the 215 fields, 1 claim has 3 errors, 4 claims have 2 errors, 
22 claims have | error, and the remaining 57 claims have no errors. (Data courtesy of 
Fritz Scheuren.) 


a Treating the claims as psus and the observations for each field as ssus, estimate 
the error rate, defined to be the average number of errors per field, along with the 
standard error for your estimate. 


b Estimate (with standard error) the total number of errors in the 828 claims. 


c Suppose that instead of taking a cluster sample, the firm had taken an SRS of 
85x 215 = 18,275 fields from the 178,020 fields in the population. If the esti- 
mated error rate from the SRS had been the same as in (a), what would the esti- 
mated variance V(Psrs) be? How does this compare with the estimated variance 
from (a)? 


Use the data in coots.dat to estimate the average egg length, along with its standard 
error. Be sure to plot the data appropriately. 


The Arizona Health Care Cost Containment System (AHCCCS) provides medical 
assistance to low-income households in Arizona. Each county determines whether 
households are eligible for assistance. Sometimes, however, households are certi- 
fied to be eligible when they are really not eligible. The Arizona Statutes, section 
36-2905.01, mandate the collection of a “statistically valid quality control sample of 
the eligibility certifications made by each county.” The certification error rate for each 
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county is to be determined “by dividing the number of members in the sample who 
were erroneously certified by the total number of members in the sample.” Quality 
control audits are done by sampling household records, however; once a household 
record is selected and audited, it costs the same amount to evaluate one person in the 
household as to evaluate all persons in the household. 


a_ Explain how to use cluster sampling to estimate the certification error rate for a 
county. 


b Suppose that a county certified 1572 households to be eligible for medical assis- 
tance in 1995. In past years, the certification error rate per household has been 
about 10%. How many households should be included in your sample so that the 
half-width of a 95% CI for estimating the per-person certification error rate is less 
than 0.03? What assumptions did you need to make to arrive at your sample size? 
Calculate the sample size for different values of M and homogeneity. 


A researcher took an SRS of 4 high schools from a region with 29 high schools for a 
study on the prevalence of smoking among female high school students in the region. 
The results were as follows: 


Number of Number of Female Number of Female Number of 
School Students Students in School Students Interviewed Smokers 
1 1471 792 25 10 
2 890 447 15 3 
3 1021 511 20 6 
4 1587 800 40 27 


a Estimate the percentage of female high school students in the region who smoke, 
along with a 95% CI. 


b Estimate the total number of female high school students in the region who smoke, 
along with a 95% CI. 


ce The researcher now wants to study the prevalence of smoking and other risk 
behaviors among female high school students in a different region with 35 high 
schools. She intends to drive to n of the schools and then interview some or all 
of the female students in the selected schools. Assuming that MSB and MSW 
are similar in the two regions, use information from the study of 4 schools to 
estimate R? and design a cluster sample for the new study. Suppose it takes about 
50 hours per school to contact school officials, obtain permission, obtain a list of 
female students, and travel back and forth. Although interviews themselves are 
only about 10 minutes, it takes about 30 minutes per interview obtained to allow 
for additional scheduling of no-shows, obtaining parental permission, and other 
administrative tasks. The investigator would like to spend 300 hours or less on 
the data collection. 


Gnap (1995) conducted a survey to estimate the teacher workload in Maricopa County, 
Arizona, public school districts. Her target population was all first through sixth grade 
full-time public school teachers with at least one year of experience. In 1994, Maricopa 
County had 46 school districts with 311 elementary schools and 15,086 teachers. Gnap 
stratified the schools by size of school district; the large stratum, consisting of schools 
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in districts with more than 5000 students, is considered in this exercise. The stratum 
contained 245 schools; 23 participated in the survey. All teachers in the selected 
schools were asked to fill out the questionnaire. Due to nonresponse, however, some 
questionnaires were not returned. (We shall examine possible effects of nonresponse 
in Exercise 12 of Chapter 8.) The data are in file teachers.dat, with psu information 
in teachmi.dat. 


a Why would acluster sample be a better design than an SRS for this study? Consider 
issues such as cost, ease of collecting data, and confidentiality of respondent. What 
are some disadvantages of using a cluster sample? 


b Calculate the mean and standard deviation of hrwork for each school in the “large” 
stratum. Construct a graph of the means for each school and a separate graph of 
the standard deviations. Does there seem to be more variation within a school, or 
does more of the variability occur between different schools? How did you deal 
with missing values (coded as —9)? 


ce Construct a scatterplot of the standard deviations versus the means for the schools, 
for the variable hrwork. Is there more variability in schools with higher workloads? 
Less? No apparent relation? 


d_ Estimate the average of hrwork in the large stratum in Maricopa County, along 
with its standard error. Use popteach in teachmi.dat for the M;’s. 


The file measles.dat contains data consistent with that obtained in a survey of par- 
ents whose children had not been immunized for measles during a recent campaign to 
immunize all children between the ages of 11 and 15. During the campaign, 7633 chil- 
dren from the 46 schools in the area were immunized; 9962 children whose records 
showed no previous immunization were not immunized. In a follow-up survey to 
explore why the children had not been immunized during the campaign, Roberts 
et al. (1995) sent questionnaires to the parents of a cluster sample of the 9962 chil- 
dren. Ten schools were randomly selected, then a sample of the M@; nonimmunized 
children from each school was selected and the parents of those children were sent a 
questionnaire. Not all parents responded to the questionnaire; you will examine the 
effects of nonresponse in Exercise 13 of Chapter 8. 


a_ Estimate, separately for each school, the percentage of parents who returned a 
consent form (variable returnf ). For this exercise, treat the “no answer” responses 
(value 9) as not returned. 


b Using the number of respondents in school i as m;, construct the sampling weight 
for each observation. 


ce Estimate the overall percentage of parents who received a consent form along 
with a 95% CI. 


d How do your estimate and interval in part (c) compare with the results you would 
have obtained if you had ignored the clustering and analyzed the data as an SRS? 
Find the ratio: 


estimated variance from (c) 


estimated variance if the data were analyzed as an SRS" 


What is the effect of clustering? 
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Repeat Exercise 16, for estimating the percentage of children who had previously had 
measles. 


Refer to Example 5.9. Later in the potato growing season, it takes more time to inspect 
stems. Suppose that it takes two minutes to inspect each stem. Which psu size is most 
efficient? 


a_ For the SRS from the Census of Agriculture data in the file agsrs.dat (discussed in 
Example 2.5), find the sample ANOVA table of acres92, using state as the cluster 
variable. Estimate R? from the sample. Is there a clustering effect? 


b Suppose that c; = 15c2, where c, is the cost to sample a state, and c is the 
cost to sample a county within a state. What should m be, if it is desired to 
sample a total of 300 counties? How many states would be sampled (that is, what 
is n)? 


Using the value of n determined in Exercise 19, draw a self-weighting cluster sample 
of 300 counties from agpop.dat, using state as the cluster variable. Plot the data using 
side-by-side boxplots. Estimate the total number of acres devoted to farms in the 
United States, along with the standard error, using both the unbiased estimate and the 
ratio estimate. How do these values compare with each other and with values from 
the SRS and stratified sample from Examples 2.5 and 3.2? 


The file ozone.dat contains hourly ozone readings from Eskdalemuir, Scotland, for 
1994 and 1995. 


a Construct a histogram of the population values. Find the mean, standard deviation, 
and median of the population. 

b Take a systematic sample with period 24. To do this, select a random integer k 
between | and 24, and select the column containing the observations with GMT 
k. Construct a histogram of the sample values. 

c Now suppose you treated your systematic sample as though it was an SRS. Find 
the sample mean, standard deviation, and median. Construct an interval estimate 
of the population mean using the procedure in Section 2.5. Does your interval 
contain the true value of the population mean from (a)? 

d Take four independent systematic samples, each with period 96. Now use formulas 
from cluster sampling to estimate the population mean, and construct a 95% CI 
for the mean. 


C. Working with Theory 


The ICC is defined as the Pearson correlation coefficient for the NM(M — 1) pairs 
(yij, Vik) for i between 1 and N andj # k: 


N MM 
>. > > (viz — Yuk — Yu) 


ices. 2 (5.43) 
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Show that the above definition is equivalent to (5.8). HINT: First show that 


N M M N M 
555 G4 — vOR —5v) + D5 YS OF — Su)” = M(SSB). 
i=1 j=l kA¥j i=1 j=l 

For the quantities in the population ANOVA table (Table 5.1), show that 
Mew“ = I oq — ICC) 
NM 
and 
MSB = AM =" ti + (M — 1)ICC]. 
M(N — 1) 


Suppose in a two-stage cluster sample that all population cluster sizes are equal 
(M; = M for all 7), and that all sample sizes for the clusters are equal (m; = m for 
all i). 


a Show (5.30). 

b Show that MSW = S?(1 — R2) and that 

N(M — 1)R? 
N-1 

ce Using (a) and (b), express V0) as a function of n, m, N, M, and Rr. 


d Show that if Ss? and the sample and population sizes are fixed, and if (m— 1)/m > 
n/N, then V(y) is an increasing function of R2. 


MSB = 9 " 1]. 


Suppose in a two-stage cluster sample that all population cluster sizes are equal 

(M; = M for all i), and that all sample sizes for the clusters are equal (m; = m for 

all i). 

a Show that fun, = 7,, and, hence, that Donk = Dp. 

b_ Fill in the formulas for the sums of squares in the ANOVA table below, for the 
sample data. 


Source df Sum of Squares Mean Square 
Between psus n—1 msb 
Within psus n(m — 1) msw 
Total nm — | msto 


ce Show that E[msw] = MSW and 
m m 
E[msb] = “MSB + (1 - ~) MSW, 
M M 


where MSB and MSW are the between and within mean squares, respectively, 
from the population ANOVA table given in Table 5.1. 


d= Show that 


m 


— M M 
MSB = —msb — (= — i) msw 
m 


is an unbiased estimator of MSB. 
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e Show, using (5.24) or (5.28), that 
v6 )=(1 ny ms (1 my mw 
_ N/ nm N M/ m— 


For the situation in Exercise 25, let msto represent the mean square total from the 
sample ANOVA table. 


a Write msto as a function of msb and msw, and use the results of Exercise 25(c) 
to find E[msto]. 


b Show that E[msto] ~ S? ifn and N are large. 
ce Show that 


a0 M(N — 1) (m—1)NM+M—m 
= ——_W—— msb + msw 
m(NM — 1) m(NM — 1) 


is an unbiased estimator of S?. 


(Requires calculus.) Show that if M; = M and m; = m for alli, and if the cost function 
is C = cjn+ conm, then 


/ c:M(MSW) / c\M(N — 1)(1 — R2) 
Mopt = 


co(MSB — MSW) _ co(NM — 1)R2 


minimizes the variance for fixed total cost C. HINT: Show the result with MSW and 
MSB first, then use Exercise 24(b). 


(Requires trigonometry.) In Example 5.12, a systematic sampling scheme was pro- 
posed for detecting hazardous wastes in landfills. How far apart should sampling 
points be placed? Suppose that if there is leakage, it will spread to a circular region 
with radius R. Let D be the distance between adjacent sampling points in the same 
row or column. 


a Calculate the probability with which a contaminant will be detected. HINT: 
Consider three cases, with R < D, D < R < /2D, and R > 2D. 


b_ Propose a sampling design that gives a higher probability that a contaminant will 
be detected than the square grid, but does not increase the number of sampling 
points. 


(Requires knowledge of random effects models.) Under Model M1 in (5.34), a one- 
way random effects model, the intraclass correlation coefficient o may be estimated by 


where Ga and 6? estimate the variance components on and o?. The methods of 
moments estimators for one-stage cluster sampling when all clusters are of the same 
size are 6? = msw and a = (msb — msw)/M, where msw and msb are the within 
and between mean squares from the sample ANOVA table. 


a What is 6 in Example 5.4? How does it compare with icc? 


b Calculate 6 for Populations A and B in Example 5.3. Why do these differ from 
the ICC? 
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(Requires knowledge of random effects models.) 


a Suppose we ignore the fpc of a model-based estimator. Find 
Vmi ( = > bi). 
ieS jES; 
b_ Prove (5.36). HINT: let 


ae bj—1 ifte Sandj €S; 
go -1 otherwise. 


Then, T—-T = 2, pe cy Vip. 


(Requires linear algebra and calculus.) Although T, is unbiased for Model M1, it is 
possible to construct an estimator with smaller variance: let 
Mk 
ce = ———— 
© T+ pm = 1) 


and 


x Ci Mo - py, s ChMy 
Topt = 2 > “TM, = eae |. 


Show that Te is unbiased and minimizes the variance in (5.36) among all unbiased 
estimators for Model (5.34). 


(Requires calculus.) Suppose that the M;’s are unequal and Model M1 holds. The 
budget allows you to take a total of L measurements on subunits. Show that the 
variance in (5.36) is minimized for T. when m; is proportional to M;. Hint: Use 
Lagrange multipliers, with the constraint )0,.. mj = L. 


D. Projects and Activities 


(Requires the R statistical software package.) Alf and Lohr (2007) present an R 
program, intervals, that explores the effects of ignoring clustering on CIs. The program 
and its usage are described on the website. 


a_ After copying the program into R, type intervals (0) to generate 100 samples, 
and their associated confidence intervals, from a population with ICC= 0. What 
is the effect of ignoring clustering when ICC= 0? 


b Nowtype intervals (0.5) tosee what happens with a population with ICC = 
1/2. What percentage of the interval estimates, calculated ignoring the clustering, 
include the true mean? What about the intervals calculated using the formulae for 
cluster samples? Compare the widths of the two interval estimates. 


c¢ What do you think will happen if you try the intervals program with ICC= 1? 
What will happen to the widths of the correctly calculated confidence intervals? 
Do you expect the percentage of interval estimates that include the true mean to 
increase or decrease for the two methods? Type intervals (1) to test your 
predictions. 
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d Explore how the estimated coverage probability depends on the ICC and M. Run 
the program with all 9 combinations of M € {2, 10,25} and ICC € {0, 0.2, 0.7, 1}. 
Plot the coverage probability versus ICC, drawing a curve for each value of M. 


The January 1994 issue of The Nation ranked 22 columnists by how much they 
used the words “I,” “me,” and “myself.” Select your favorite newspaper columnist or 
blogger. Randomly select 5 of the columnist’s or blogger’s columns that appeared in 
the past year, and use one-stage cluster sampling to estimate the proportion of total 
words taken up by “I,” “me,” and “myself.” What is your psu? Your ssu? 


Online bookstore. You may have noticed in Exercise 33 of Chapter 2 that it took quite 
a bit of time to locate the records chosen for the SRS. It may be faster to take an SRS 
of pages from the website, then look at some or all of the books listed on that page. 
Use the following procedure to take a cluster sample of books from the genre you 
studied in Chapter 2, recording the amount of time you spend selecting the sample and 
collecting the data. Take an SRS of 10 pages, then sample 5 books per page. For each 
sampled book, record the price, number of pages, and whether the book is paperback 
or hardback. Estimate the mean of each variable, and give a 95% CI. Do you think 
clustering decreased precision relative to an SRS? Compare the precision per unit 
time for the SRS and the cluster sample by calculating 1/[(estimated variance) x 
time] for each method. 


Baseball data. 

a_ Use the population in the file baseball.dat to take a one-stage cluster sample 
with the teams as the psus. Your sample should have approximately 150 players 
altogether, as in the SRS from Exercise 32 of Chapter 2. Describe how you selected 
your sample. The SAS code in the file clusterselect.sas may be helpful. 

b Draw side-by-side boxplots of /ogsal for the teams in your sample. 

c Use your sample to estimate the mean of the variable logsal, and give a 95% CI. 

d Estimate the proportion of players in the data set who are pitchers, and give a 
95% CI. 

e Compare your estimates with those from Exercise 32 of Chapter 2. Which esti- 
mates have smaller CIs? Why do you think that happened? 


Baseball data: Two-stage sample. 

a_ Use your SRS from Exercise 32 of Chapter 2 to estimate the population value of 
R?. If we treat teams as the psus, and if all teams had the same size, what would 
the optimal subsampling size be (assume that c) = cz). 

b Use the population in the file baseball.dat to take a two-stage cluster sample with 
the teams as the psus, using the subsampling fraction from part (a). Your sample 
should have approximately 150 players altogether, as in the SRS from Exercise 32 
of Chapter 2. Describe how you selected your sample. 

ce Draw side-by-side boxplots of /ogsal for the teams in your sample. 

d_ Use your sample to estimate the mean of the variable logsal, and give a 95% CI. 


e Estimate the proportion of players in the data set who are pitchers, and give a 
95% CI. 
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f Compare your estimates with those from Exercise 32 of Chapter 2. Which esti- 
mates have smaller CIs? Why do you think that happened? 


g Compare your estimates with those from Exercise 36. 


38 =IPUMS exercises. 
a Generate a frequency table of the number of persons within each psu. 


b Suppose that it costs $50 per interview to collect data using an SRS. If a cluster 
sample is taken, it costs $100 per psu chosen, plus $20 for each interview taken. 
Select an SRS of 10 psus. In each of the selected psus, take a subsample of persons 
with sample size proportional to the population size within that psu. Your total 
cost for the sample should be about the same as for the SRS you took in Chapter 2. 


c Using the sample you selected, estimate the population mean of inctot and give 
the standard error of your estimate. Also estimate the population total of inctot 
and give its standard error. How do these estimates compare with those from the 
SRS you took in Chapter 2? 


EXAMPLE 6.1 


Sampling with Unequal 
Probabilities 


‘Personally | never care for fiction or storybooks. What | like to read about are facts and statistics of any 
kind. If they are only facts about the raising of radishes, they interest me. Just now, for instance, before 
you came in’'—he pointed to an encyclopaedia on the shelves—'l was reading an article about 
“Mathematics.” Perfectly pure mathematics. 

‘My own knowledge of mathematics stops at “twelve times twelve,” but | enjoyed that article 
immensely. | didn't understand a word of it; but facts, or what a man believes to be facts, are always 
delightful. That mathematical fellow believed in his facts. So do |. Get your facts first, and'—the voice 
dies away to an almost inaudible drone—'then you can distort ‘em as much as you please.’ 


—NMark Twain, quoted in Rudyard Kipling, From Sea to Sea 


Up to now, we have only discussed sampling schemes in which the probabilities of 
choosing sampling units are equal. Equal probabilities give schemes that are often 
easy to design and explain. Such schemes are not, however, always possible or, if 
practicable, as efficient as schemes using unequal probabilities. We saw in Exam- 
ple 5.8 that a cluster sample with equal probabilities may result in a large variance 
for the design-unbiased estimator of the population mean and total. 


O’Brien et al. (1995) took a sample of nursing home residents in the Philadelphia area, 
with the objective of determining residents’ preferences on life-sustaining treatments. 
Do they wish to have cardiopulmonary resuscitation (CPR) if the heart stops beating, 
or to be transferred to a hospital if a serious illness develops, or to be fed through 
an enteral tube if no longer able to eat? The target population was all residents of 
licensed nursing homes in the Philadelphia area. There were 294 such homes, with a 
total of 37,652 beds (before sampling, they only knew number of beds, not number 
of residents). 

Because the survey was to be done in person, cluster sampling was essential for 
keeping survey costs manageable. Had the researchers chosen to use cluster sampling 
with equal probabilities of selection, they would have taken a simple random sample 
(SRS) of nursing homes, then another SRS of residents within each selected home. 

In a cluster sample with equal probabilities, however, a nursing home with 20 
beds is as likely to be chosen for the sample as a nursing home with 1000 beds. The 
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sample is only self-weighting if the subsample size for each home is proportional to 
the number of beds in the home. Each bed sampled represents the same number of 
beds in the population if one-stage cluster sampling is used, or if 10% (or any other 
fixed percentage) of beds are sampled in each selected home. 

Sampling homes with equal probabilities would result in a mathematically valid 
estimator, but it has three major shortcomings. First, you would expect that the total 
number of patients in a home who desire CPR (t;) would be proportional to the 
number of beds in the home (Mj), so estimators from Chapter 5 may have large 
variance. Second, a self-weighting equal-probability sample may be cumbersome 
to administer. It may require driving out to a nursing home just to interview one 
or two residents, and equalizing workloads of interviewers may be difficult. Third, 
the cost of the sample is unknown in advance—a random sample of 40 homes may 
consist primarily of large nursing homes, which would lead to greater expense than 
anticipated. 

Instead of taking a cluster sample of homes with equal probabilities, the investi- 
gators randomly drew a sample of 57 nursing homes with probabilities proportional 
to the number of beds. They then took an SRS of 30 beds (and their occupants) from a 
list of all beds within the nursing home. If the number of residents equals the number 
of beds, and if a home has the same number of beds when visited as are listed in the 
sampling frame, then the sampling design results in every resident having the same 
probability of being included in the sample. The cost is known before selecting the 
sample, the same number of interviews are taken at each home, and the estimator of 
a population total will likely have a smaller variance than estimators in Chapter 5. 

Since this sample is self-weighting, you can easily obtain point estimates (but not 
standard errors) of desired quantities by usual methods. You can estimate the median 
age of the nursing home residents by finding the sample median of the residents in 
the sample, or the 70th percentile by finding the 70th percentile of the sample. If a 
sample is not self-weighting, point estimates are still easily calculated using weights. 
A warning, though: Always consider the cluster design when calculating the precision 
of your estimates. m= 


In Chapter 3 we noted that sometimes stratified sampling is used to sample units 
with different probabilities. In a survey to estimate total business expenditures on 
advertising, we might want to stratify by company sales or income. The largest com- 
panies such as IBM would be in one stratum, medium-sized companies would be in 
a number of different strata, and very small companies such as Robin’s Tailor Shop 
would be in yet other strata. An optimal allocation scheme would sample a very high 
fraction (perhaps 100%) in the stratum with the largest companies, and a small fraction 
of companies in the strata with the smallest companies; the variance from company 
to company will be much higher among IBM, General Motors, and Microsoft than 
among Robin’s Tailor Shop, Pat’s Shoe Repair, and Flowers by Leslie. The variance is 
larger in the large companies just because the amounts of money involved are so much 
larger. Thus, the sampling variance is decreased by assigning unequal probabilities 
to sampling units in different strata. 

To estimate the total spent on advertising using this stratified sample, we assign 
higher weights to companies with lower inclusion probabilities. As discussed in Sec- 
tion 3.3, the probability that a company in stratum h will be included in the sample is 
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ny, /Njy; the sampling weight for that company is N;,/n;,. Each company sampled in stra- 
tum / represents N;, /n;, companies in the population, and tetr = eae ye ieS), (Na /Mn)Vhj- 

We can also use unequal inclusion probabilities to decrease variances without 
explicitly stratifying. When sampling with unequal probabilities, we deliberately vary 
the probabilities that we will select different primary sampling units (psus) for the 
sample, and compensate by providing suitable weights in the estimation. The key is 
that we know the probabilities! with which we will select a given unit: 


P(unit i selected on first draw) = yw; (6.1) 


P(unit i in sample) = 77. (6.2) 


The deliberate selection of psus with known but unequal probabilities differs 
greatly from the selection bias discussed in Chapter 1. Many surveys with selec- 
tion bias do sample with unequal probabilities, but the probabilities of selection are 
unknown and unestimable, so the survey takers cannot compensate for the unequal 
probabilities in the weighting. If you take a survey of students by asking students 
who walk by the library to participate, you certainly are sampling with unequal 
probabilities—students who use the library frequently are more likely to be asked 
to participate in the survey, while other students never go by the library at all. But you 
have no idea how many students in the population are represented by a participant 
in your survey, and no way of correcting for the unequal probabilities of selection in 
the estimation. In addition, some students in your target population never walk by the 
library, so they cannot be included in your sample. 

When first presented with the idea of unequal-probability sampling, some peo- 
ple think of it as “unnatural” or “contrived.” On the contrary, for many populations 
with clustering, unequal-probability sampling at the psu level produces a sample that 
mirrors the population better than an equal-probability sample. Examples of unequal- 
probability samples are given in Section 6.5. To understand these examples and to 
design your own samples, it is essential that you have an understanding of probability. 
We consider with-replacement sampling first, starting with the simple design of select- 
ing one psu. Many large sample surveys are analyzed as though the sampling was 
done with replacement, even if a without-replacement sample was collected, because 
the estimators of the variance for with-replacement samples have simple form and 
require less information. In Section 6.4, we consider unequal-probability sampling 
without replacement. Notation used in this chapter is defined in Section 5.1. 


Sampling One Primary Sampling Unit 


As a special case, suppose we select just one (7 = 1) of the N psus to be in the sample. 
The total for psu i is denoted by ¢;, and we want to estimate the population total, ¢. 
Sampling one psu will demonstrate the ideas of unequal-probability sampling without 
introducing the complications. 


'We consider two different probabilities in this chapter, because when sampling with unequal 
probabilities without replacement, as considered in Section 6.4, selecting a unit on the first draw can 
affect the selection probabilities for other units. 
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Let’s start out by looking at what happens for a situation in which we know the 
whole population. A town has four supermarkets, ranging in size from 100 square 
meters (m2) to 1000 m?. We want to estimate the total amount of sales in the four 
stores for last month by sampling just one of the stores. (Of course, this is just 
an illustration—if we really had only four supermarkets we would probably take 
a census.) You might expect that a larger store would have more sales than a smaller 
store, and that the variability in total sales among several 1000-m7 stores will be 
greater than the variability in total sales among several 100-m7 stores. 

Since we sample only one store, the probability that a store is selected on the first 
draw (w;) is the same as the probability that the store is included in the sample (z;). 
For this example, take 


T; = W; = P(Store i selected) 
proportional to the size of the store. Since Store A accounts for 1/16 of the total floor 


area of the four stores, it is sampled with probability 1/16. For illustrative purposes, 
we know the values of t; for the whole population: 


Store Size (m*) Wi t; (in Thousands) 
A 100 : 11 
16 
B 200 a 20 
16 
3 
C 300 — 24 
16 
D 1000 a 245 
16 
Total 1600 1 300 


We could select a probability sample of size | with the probabilities given above 
by shuffling cards numbered | through 16 and choosing one card. If the card’s number 
is 1, choose store A; if 2 or 3, choose B; if 4, 5, or 6, choose C; and if 7 through 16, 
choose D. Or we could spin once on a spinner like this: 
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We compensate for the unequal probabilities of selection by also using yw; in 
the estimator. We have already seen such compensation for unequal probabilities 
in stratified sampling: If we select 10% of the units in stratum | and 20% of the units 
in stratum 2, the sampling weight is 10 for each unit in stratum 1| and 5 for each unit in 
stratum 2. Here, we select store A with probability 1/16, so store A’s sampling weight 
is 16. If the size of the store is roughly proportional to the total sales for that store, we 
would expect that store A also has about 1/16 of the total sales and that multiplying 
store A’s sales by 16 would estimate the total sales for all four stores. As always, the 
sampling weight of unit 7 is the reciprocal of the probability of selection: 


1 1 
ve P(unitiinsample) Wj 


Thus, our estimator of the population total from an unequal-probability sample of 
size | is 


Four samples of size | are possible from this simple population: 


Sample Wi t; ty (ty — 1)? 
1 

A — 11 176 15,376 
{A} 16 
2 

(B} a 20 «160 19,600 
3 

{C} 16 24 128 29,584 
10 

{D} (dS 32 8,464 


Using the definition of expected value in (2.3), 


Elyl= SY) PSdiys 


possible 

samples S 
a (176) + 2 (160) + : (128) + 10 (392) = 300 
~ 16 16 16 16 —— 


Of course, ty will always be unbiased because in general, 


N 
Ely) =~ we = (6.3) 
i=l : 
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The variance of ty is 
Vity] = El@y — 1)71 
= >> P(S)\éys — 1? 


possible 
samples S 


N ‘ 2 
=Viv; (+ = ) (6.4) 
Lvl g, 


For this example, 
Vityl = ! (15,376) + @ (19,600) + i (29,584) + 10 (8 464) = 14,248 
ee age for 7" 16? [eo ee 


Compare these results to those from an SRS of size 1, in which the probability of 
selecting each unit is y= 1/4, so 1/w; =4=N. Note that if all of the probabilities 
of selection are equal, as in simple random sampling, 1/w; always equals N. For the 
SRS design: 


Sample Wi t; ty (fy — 0° 
1 
{A} i ll 44 65,536 
1 
{B} re 20 80 48,400 
1 
{C} r 24 96 41,616 
1 
{D} 7 245 ~—-980 462,400 


As always, fsgs is unbiased and thus has expectation 300, but for this example the 
SRS variance is much larger than the variance from the unequal-probability design: 


A 1 1 1 1 
Vitsrs] = yore + Tiga cay + see + qe) = 154,488. 


The variance from the unequal-probability design, 14,248, is much smaller because 
the design uses auxiliary information: We expect the store size to be related to the 
sales, and use that information in the sample design. 

We believe that t; is correlated to the size of the store, which is known. Since Store 
D accounts for 10/16 of the total floor area of supermarkets, it is reasonable to believe 
that Store D will account for about 10/16 of the total sales as well. Thus, if store D is 
chosen and is believed to account for about 10/16 of the total sales, we would have a 
good estimate of total sales by multiplying Store D’s sales by 16/10. 

What if Store D only accounts for 4/16 of the total sales? Then the unequal- 
probability estimator ty will still be unbiased over repeated sampling, but it will have 
a large variance (see Exercise 3). The method still works mathematically, but is not 
as efficient as if t; is roughly proportional to y;. 

Sampling only one psu is not as unusual as you might think. Many large complex 
surveys are so highly stratified that each stratum contains only a few psus. A large 


b2 
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number of strata is used to increase the precision of the survey estimates. In such 
a survey, it may be perfectly reasonable to want to select only one psu from each 
stratum. But, with only one psu per stratum in the sample, we do not have an estimate 
of the variability between psus within a stratum. When large survey organizations 
sample only one psu per stratum, they often divide the psus into pseudo-psus for 
variance estimation. 


One-Stage Sampling with Replacement 


Now suppose n> 1, and we sample with replacement. Sampling with replacement 
means that the selection probabilities do not change after we have drawn the first unit. 
Let 


Ww; = P(select unit i on first draw). 


If we sample with replacement, then 7; is also the probability that unit i is selected 
on the second draw, or the third draw, or any other given draw. 

The idea behind unequal-probability sampling is simple. Draw 7 psus with replace- 
ment. Then estimate the population total, using the estimator from the previous sec- 
tion, separately for each psu drawn. Some psus may be drawn more than once—the 
estimated population total, calculated using a given psu, is included as many times as 
the psu is drawn. Since the psus are drawn with replacement, we have n independent 
estimates of the population total. Estimate the population total t by averaging those 
n independent estimates of t. The estimated variance is the sample variance of the n 
independent estimates of t, divided by n. 


621 Selecting Primary Sampling Units 


EXAMPLE 6.2 


6.2.1.1 The Cumulative-Size Method 


There are several ways to sample psus with unequal probabilities. All require that you 
have a measure of size for all psus in the population. The cumulative-size method 
extends the method used in the previous section, in which random numbers are gen- 
erated and psus corresponding to those numbers are included in the sample. For the 
supermarkets, we drew cards from a deck with cards numbered | through 16. If the 
card’s number is 1, choose store A; if 2 or 3, choose B; if 4, 5, or 6, choose C; and if 7 
through 16, choose D. To sample with replacement, put the card back after selecting 
a psu and draw again. 


Consider the population of introductory statistics classes at a college shown in 
Table 6.1. The college has 15 such classes; class i has M; students, for a total of 
647 students in introductory statistics courses. We decide to sample 5 classes with 
replacement, with probability proportional to M;, and then collect a questionnaire 
from each student in the sampled classes. For this example, then, ; = M;/647. 

To select the sample, generate five random integers with replacement between 
1 and 647. Then the psus to be chosen for the sample are those whose range in the 
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TABLE 6.1 
Population of Introductory Statistics Classes 


Class Cumulative 
Number M; Wi M; Range 


44 0.068006 1 44 
33 0.051005 45 77 
26 0.040185 78 103 
22 0.034003 104 125 
76 0.117465 126 201 
63 0.097372 202 264 
20 0.030912 265 284 
44 0.068006 285 328 
54 0.083462 329 382 
34 0.052550 383 416 
46 0.071097 417 462 
24 0.037094 463 486 
46 0.071097 487 532 
100 0.154560 533 632 
15 0.023184 633 647 


ao 
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Total 647 1 


cumulative M; includes the randomly generated numbers. The set of five random 
numbers {487, 369, 221, 326, 282} results in the sample of units {13, 9, 6, 8, 7}. 
The cumulative-size method allows the same unit to appear more than once: the five 
random numbers {553, 082, 245, 594, 150} leads to the sample {14, 3, 6, 14, 5}— 
psu 14 is then included twice in the data. SAS code for selecting a sample from this 
population is given on the website. m= 


Of course, we can take an unequal-probability sample when the y;’s are not 
proportional to the M;’s: Simply form a cumulative 7; range instead, and sample 
uniform random numbers between 0 and |. This variation of the method is discussed 
in Exercise 2. 

Systematic sampling is often used to select psus in large complex samples, rather 
than generating random numbers with replacement. Systematic sampling usually 
gives asample without replacement, but in large populations sampling without replace- 
ment and sampling with replacement are very similar, as the probability that a unit 
will be selected twice is small. To sample psus systematically, list the population 
elements for the first psu in the sample, followed by the elements for the second 
psu, and so on. Then take a systematic sample of the elements. The psus to be 
included in the sample are those in which at least one element is in the system- 
atic sample of elements. The larger the psu, the higher the probability it will be in the 
sample. 


EXAMPLE 6.3 
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The statistics classes have a total of 647 students. To take a (roughly, because 647 
is not a multiple of 5) systematic sample, choose a random number k between | and 
129 and select the psu containing student k, the psu containing student 129+ k, the 
psu containing student 2(129) + k, and so on. Suppose the random number we select 
as a Start value is 112. Then the systematic sample of elements results in the following 
psus being chosen: 


Number in 
Systematic Sample psu Chosen 
112 4 
241 6 
370 9 
499 13 
628 14 


Larger classes (psus) have a higher chance of being in the sample because it is 
more likely that a multiple of the random number chosen will be one of the numbered 
elements in a large psu. Systematic sampling does not give us a true random sample 
with replacement, though, because it is impossible for classes with 129 or fewer 
students to occur in the sample more than once, and classes with more than 129 
students are sampled with probability 1. In many populations, however, it is much 
easier to implement than methods that do give a random sample. If the psus are 
arranged geographically, taking a systematic sample may force the selected psus to 
be spread out over more of the region, and may give better results than a random 
sample with replacement. 


6.2.1.2  Lahiri's Method 


Lahiri’s (1951) method may be more tractable than the cumulative-size method when 
the number of psus is large. It is an example of a rejective method, because you 
generate pairs of random numbers to select psus and then reject some of them if the 
psu size is too small. Let N = number of psus in population and max{M;} = maximum 
psu size. You will show that Lahiri’s method produces a with-replacement sample 
with the desired probabilities in Exercise 15. 


1 Draw a random number between 1 and N. This indicates which psu you are 
considering. 


2 Draw arandom number between | and max{M;}. If this random number is less 


than or equal to M;, then include psu i in the sample; otherwise go back to step 1. 


3 Repeat until desired sample size is obtained. 


Let’s use Lahiri’s method for the classes in Example 6.2. The largest class has 
max{M;} = 100 students, so we generate pairs of random integers, the first between 
1 and 15, the second between | and 100, until the sample has five psus (Table 6.2). 
The psus to be sampled are {12,14,14,5, 1}. = 
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TABLE 6.2 
Lahiri’s Method, for Example 6.3 


First Random Second Random 
Number (psu 7) Number M; Action 
12 6 24 6 < 24; include psu 12 in sample 
14 24 100 Include in sample 
1 65 44 65 > 44; discard pair of numbers 
and try again 
7 84 20 84 > 20; try again 
10 49 34 Try again 
14 47 100 Include 
15 43 15 Try again 
2) 24 76 Include 
11 87 46 Try again 
1 36 44 Include 


622 Theory of Estimation 


Because we are sampling with replacement, the sample may contain the same psu 
more than once. Let R denote the set of n units in the sample, including the repeats. 
For Example 6.3, R = {12, 14, 14,5, 1}; unit 14 is included twice in R. We saw in 
Section 6.1 that for a sample of size 1, u;=t;/w; 1s an unbiased estimator of the 
population total t. When we sample n psus with replacement, we have n independent 
estimators of t, so we average them: 


ty == == uy = u. (6.5) 


We estimate Viiy) by 


se 11 11 tj : 
VGy) = = ;-iyr= a 6.6 
(%y) n aes" u) eer ') we) 


The estimator hy in (6.5) is often referred to as the Hansen—Hurwitz (1943) estimator. 

Equation (6.6) is the estimated variance of the average u from a simple random 
sample with replacement. Where are the unequal probabilities in the variance estima- 
tor? To prove that ty and Vy) are unbiased estimators of t and Vity), respectively, 
we need random variables to keep track of which psus occur multiple times in the 
sample. Define 


Q; = number of times unit i occurs in the sample; 


Q; is a with-replacement analogue of the random variable Z; used to indicate sample 
inclusion for without-replacement sampling in (2.27). Then, ty is the average of all 
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t;/W; for units chosen to be in the sample, including each unit as many times as it 
appears in the sample: 


ip +y 4 a= =i dot. (6.7) 


NER 


If a unit appears k times in the sample, it is counted k times in the estimator. Note that 

pan Q; =n and E[Q;] = nw; (see Exercise 16), so ty is an unbiased estimator of t. 
To calculate the variance, note that the estimator in (6.7) is the average of n 

independent observations, each with variance yy Wilti/Wi — t) [from (6.4)], so 


a 1 a ti 7 
Vey) = — Divi (+ = ) (6.8) 


To show that the variance estimator in (6.6) is unbiased for Vity), we write it in terms 
of the random variables Q;: 


(oes = ea oe ae 6.9 
ei Lly ’) ~nn—1 22 (+ - i) , a 


Equation (6.8) involves a weighted average of the N values of (¢;/w; — 1)”, weighted 
by the unequal selection probabilities w;. In taking the sample, we have already used 
the unequal probabilities—they appear in the random variables Q; in (6.7). The ith 
psu appears Q; times in the with-replacement sample. Because the n units are selected 


independently, E[Q;]=nw;, so including the squared deviation (t; [Wi - ty a total 
of Q; times in the variance estimator causes (6.9) to be an unbiased estimator of the 
variance in (6.8): 


ek te 1 = tj K 2 
E[V(ty)] = (aD YE QO; (+ = is) 
i=l d 


1 x t; 2 ; ; ; 

= aint |ba(é-) + n(ty — t) = nip 0 
Bee Svs (2-1) ave ) 
n(n — 9) i=1 Wi v 


= V(ty). 


EXAMPLE 6.4 
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In line 4 of the argument, we use the facts that a Q; =n and ear Oit: /Wi= Nhy. 
In Exercise 7, you will show that the variance estimator for simple random sampling 
with replacement is a special case of (6.9). 

Warning: If N is small or some of the y;’s are unusually large, it is possible 
that the sample will consist of one psu sampled n times. In that case, the estimated 
variance is zero; it is better to use sampling without replacement (see Section 6.4) if 
this may occur. 

We estimate the population mean y,, by 


A t 
yor, (6.10) 
Moy 
where 
i 1 M; 
Moy = - — (6.11) 
Me iceR Wi 


estimates the total number of elements in the population. In (6.10), Vy is a ratio; 


using results in Chapter 4, we calculate the residuals t;/w; — yy Mi / Wy; to estimate the 
variance: 


A 2, 
AU” 1 1 1 tj YyM; 
V — . 6.12 
oy (Moy) nn—1 me (3 Wi ee 


iceR 
Note that (6.12) is an estimated variance of the same form as (6.6) with the values 
(t; — YyMi)/ Moy substituted for 7;. 


For the situation in Example 6.3, suppose we sample the psus selected by Lahiri’s 
method, {12, 14, 14, 5, 1}. The response ¢; is the total number of hours all students 
in class i spent studying statistics last week, with the following data: 


tj 


Class Wi ti — 
Wi 
12; ha 75 2021.875 
647 : 
100 
14 —— 2 1313.41 
6a7 03 313.410 
1 
14 paul 203 1313.410 
647 
7 
5 ae 191 1626.013 
647 
44 
1 — 168 2470.364 
647 


The numbers in the last column of the table are the estimates of t that would be 
obtained if that psu were the only one selected in a sample of size 1. The population 
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total is estimated by averaging the five values of t;/1;, using (6.5): 


re 2021.875 + 1313.410 + 1313.410 + 1626.013 + 2470.364 
ty = ue 7 5 = = = 1749.014. 


The standard error of ty is simply s/,/n [see Equation (6.6)], where s is the sample 
standard deviation of the five numbers in the rightmost column of the table: 


SE(y) = 


1 / (2021.875 — 1749.014)? + --- + (2470.364 — 1749.014)2 
7G 4 
= 999-49. 


Since 7; = M;/Mo for this sample, we have Mo = Mop = 647. The average amount 
of time a student spent studying statistics is estimated as 


a _ 1749.014 _ 
oe 
hours. For this example, with w; = M;/Mo, (6.12) simplifies to 


6.) i. <i, 4 3 ( ti tyM,; )! Vay) 
y = — 7 
VO (MoP nn 1 Xi Movi M3 


2.70 


so the standard error of Vy is 222.42/647 = 0.34 hours. [In other examples, if Mo can 
vary from sample to sample, this simplification of (6.12) does not occur.] SAS code 
for finding hy and y,, is provided on the website. = 
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We would like to choose the y;’s so that the variance of ty is as small as possible. 
Ideally, we would use w; = t;/t (then ty = t for all samples and Viiv] = 0), so if t; is the 
annual income of the ith household, wW; would be the proportion of total income in the 
population that came from the ith household. But of course, the 7;’s are unknown until 
sampled. Even if the income were known before the survey was taken, we are often 
interested in more than one quantity; using income for designing the probabilities of 
selection may not work well for estimating other quantities. 

Because many totals in a psu are related to the number of elements in a psu, we 
often take 7; to be the proportion of elements in psu i or the relative size of psu i. 
Then, a large psu has a greater chance of being in the sample than a small psu. With 
M,; the number of elements in the ith psu and Mp = ae M,; the number of elements 
in the population, we take w; =M;/Mo. With this choice of the probabilities W;, we 
have probability proportional to size (pps) sampling. We used pps sampling in 
Example 6.2. 


7 1 
Then for one-stage pps sampling, t;/w;=t;Mo/M; = Moy,, so ty = — ys Moy; 
1 iceR 
and vy =- > 5: With Wy; = Mi/Mo, Vy is the average of the sampled psu means. 
n 
iceR 
Also, for w; = M;/Mo, (6.11) implies that Moy = Mp for every possible sample, so 
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FIGURE 6.1 

Selected plots for pps sample estimating the total number of physicians in the United States. (a) 
Plot of ¢; versus w;; there is a strong linear relationship between the variables, which indicates 
that pps sampling increases efficiency. The unusual observation (marked by the ‘+’) is New 


York County, New York. (b) Histogram of the 100 values of 1;/w;. Each value estimates t. 
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EXAMPLE 6.5 


from Individual psu's in Sample 


from (6.12), VO) = —— dX 0; - ie Note that VOY) is of the form s?/n, 
where s? is the sample variance of the psu means J;. 

All of the work in pps sampling has been done in the sampling design itself. The 
pps estimates can be calculated simply by treating the y,’s as individual observations, 
and finding their mean and sample variance. In practice, however, there are usually 
some deviations from a strict pps scheme, so you should use (6.5) and (6.6) for 
estimating the population total and its estimated variance. 


The file statepop.dat contains data from an unequal-probability sample of 100 counties 
in the United States. Counties were chosen using the cumulative-size method from 
the listings in the County and City Data Book (U.S. Census Bureau, 1994) with 
probabilities proportional to their populations. The total population for all counties is 
Mo = yy M; = 255,077,536. Sampling was done with replacement, so very large 
counties occur multiple times in the sample: Los Angeles County, with the largest 
population in the United States, occurs four times. 

One of the quantities recorded for each county was the number of physicians in the 
county. You would expect larger counties to have more physicians, so pps sampling 
should work well for estimating the total number of physicians in the United States. 

You must be careful in plotting data from an unequal-probability sample, as you 
need to consider the unequal probabilities when interpreting the plots. A plot of ¢; 
versus w; (Figure 6.1a) tells the efficiency of the unequal-probability design: The 
design works well when the plot shows positive correlation. A histogram of ¢; in a 
pps sample will not give a representative view of the population of psus, as psus with 
large w;’s are overrepresented in the sample. A histogram of t;/w;, however, may give 
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TABLE 6.3 
Sampled Counties in Example 6.5 


Population Number of P 

State County Size, M; Wi Physicians, ¢; 7 
AL Wilcox 13,672 | 0.00005360 4 | 74,627.72 
AZ Maricopa 2,209,567 | 0.00866233 4320 | 498,710.81 
AZ Maricopa 2,209,567 | 0.00866233 4320 | 498,710.81 
AZ Pinal 120,786 | 0.00047353 61 | 128,820.64 
AR Garland 76,100 | 0.00029834 131 | 439,095.36 
AR Mississippi 55,060 | 0.00021586 48 | 222,370.54 
CA Contra Costa 840,585 | 0.00329541 1761 | 534,379.68 
VA Chesterfield 225,225 | 0.00088297 181 | 204,990.72 
WA | King 1,557,537 | 0.00610613 5280 | 864,704.59 
WI Lincoln 27,822 | 0.00010907 28 | 256,709.47 
Wl Waukesha 320,306 | 0.00125572 687 | 547,096.42 
average 570,304.30 
std. dev. 414,012.30 


an idea of the spread involved in the population estimates, and may help you identify 
unusual psus (Figure 6.1b). 

The sample was chosen using the cumulative-size method; Table 6.3 shows the 
sampled counties arranged alphabetically by state. The y;’s were calculated using 
Wi; = M;/Mo. The average of the t;/w; column is 570,304.3, the estimated total number 
of physicians in the United States. The standard error of the estimate is 
414,012.3//100 = 41,401. For comparison, the County and City Data Book lists a 
total of 532,638 physicians in the United States; a 95% CI using our estimate includes 
the true value. 

These estimates can be found using the SAS code on the website. Partial output 
is given below: 


Data Summary 


Number of Observations 100 
Sum of Weights 2450.71956 
Statistics 
Std Error 
Variable N Mean of Mean 95% CL for Mean 
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Statistics 
Variable Sum Std Dev 95% CL for Sum 
physicns 570304 41401 488155.273 652453.317 


What if you do not know the value of M; for each psu in the population? In that 
case, you may know the value of a quantity that is related to M;. If sampling fish, you 
may not know the number of fish in a haul but you may know the total weight of fish 
in a haul. You can then use x; = (total weight of fish in haul 7) to set the selection 
probability for haul i as w; = x;/t,. Since x;/t, is not exactly the same as M;/Mo, you 
then must use (6.5) and (6.6) for estimating the population total of y and its standard 
error. 


b24 Weights in Unequal-Probability Sampling with 
Replacement 


As in other types of sampling, we can estimate the population total f using weights. 
In without-replacement sampling, we use the reciprocal of the inclusion probability 
(= 1/E[Z;]) as the weight for a unit; E[Z;] is the expected number of times unit 7 
appears in the sample (expected number of “hits’”). In with-replacement sampling, 
we use the first-stage weight 


1 1 1 


expected number of hits —£E [O;,) ny; 
With this choice of weight, we have, for ty in Equation (6.5), 
ty = » Wilj. 
icR 
In one-stage cluster sampling with replacement, we observe all of the M; ssus 


every time psu i is selected, so we define 
1 
ni 


Wii = Wi = 
Then, in terms of the elements, 
M; 
ty = > De Wi Vij 
ic€R j=l 
and 


Mi 
pm > Wii 


Aa icR j=l 
y — 


icR j=l 
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If the selection probabilities wy; are unequal, the sample is not self-weighting. In one- 
stage pps sampling, elements in large psus have smaller weights than elements in 
small psus. 


b. 


Two-Stage Sampling with Replacement 


The estimators for two-stage unequal-probability sampling with replacement are 
almost the same as those for one-stage sampling. Take a sample of psus with replace- 
ment, choosing the ith psu with known probability w;. As in one-stage sampling 
with replacement, Q; is the number of times psu i occurs in the sample. Then take 
a probability sample of m; subunits in the ith psu. Simple random sampling without 
replacement or systematic sampling is often used to select the subsample, although 
any probability sampling method may be used. 

The only difference between two-stage sampling with replacement and one-stage 
sampling with replacement is that in two-stage sampling, we must estimate 7;. If 
psu i is in the sample more than once, there are Q; estimators of the total for psu i: 
ta, ti, Beare b9;- 

The subsampling procedure needs to meet two requirements: 


1 Whenever psu i is selected to be in the sample, the same subsampling design 
is used to select secondary sampling units (ssus) from that psu. Different sub- 
samples from the same psu, though, must be sampled independently. Thus, if 
you decide before sampling that you will take an SRS of size 5 from psu 42 
if it is selected, every time psu 42 appears in the sample you must generate a 
different set of random numbers to select 5 of the ssus in psu 42. If you just 
take one subsample of size 5, and use it more than once for psu 42, you do not 
have independent subsamples and (6.14) will not be an unbiased estimator of the 
variance. 


2 The jth subsample taken from psu i (for j = 1,...,Q;) is selected in such a way 


that E lt] = f;. Because the same procedure is used each time psu / is selected for 
the sample, we can define Viti] = JV; for all /. 


The estimators from one-stage unequal sampling with replacement are modified 
slightly to allow for different subsamples in psus that are selected more than once: 


N Q; 7 


=") a, (6.13) 


vy 


N Qi A 2 
VGy) = =>): (+ -iy) (6.14) 


i=1 j=l 


In Exercise 16, you will show that (6.14) is an unbiased estimator of the variance 
Viiv), given in (6.46). Because sampling is with replacement, and hence it is possible 
to have more than one subsample from a given psu, the variance estimator captures 
both parts of the variance: the part due to the variability among psus, and the part that 
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arises because f; is estimated from a subsample rather than observed. When the psus 
are sampled with replacement, and when an independent subsample is chosen each 
time a psu is selected, the variance estimator can be calculated in the same way as if 
the psu totals were measured rather than estimated. 

The estimator of the population mean y,, has a form similar to (6.10): 


A 


Vy = = > 
Moy 
where 
eee ga 
ut igER Wi 


estimates the total number of elements in the population. The variance estimator again 
uses the ratio results in (6.12): 


ee a 1, Mi; : 
(oes = Key(é- ¥ ot). (6.15) 


(Moy nn— 1 i=l j=l 


The weights for the observation units include a factor to reflect the subsampling 
within each psu. If an SRS of size m; is taken in psu i, the weight for ssu j in psu i is 


1M 
wi = ea (6.16) 
In a pps sample, in which the ith psu is selected with probability W; = M;/Mo, the 
weight for ssu j of psu i is wy = > ~ = ae a pps sample is self-weighting if 


all m;’s are equal. 
In summary, here are the steps for taking a two-stage unequal-probability sample 
with replacement: 


1 Determine the probabilities of selection y;, the number 7 of psus to be sampled, 
and the subsampling procedure to be used within each psu. With any method of 
selecting the psus, we take a probability sample of ssus within the psus: often in 
two-stage cluster sampling, we take an SRS (without replacement) of elements 
within the chosen psus. 


2 Select n psus with probabilities yy; and with replacement. Either the cumulative- 
size method or Lahiri’s method may be used to select the psus for the sample. 


3 Use the procedure determined in step | to select subsamples from the psus chosen. 
If a psu occurs in the sample more than once, independent subsamples are used 
for each replicate. 


4 Estimate the population total ¢ from each psu in the sample as though it were the 
only one selected. The result is n estimates of the form ti /Wi. 


5 ty is the average of the n estimates in step 4. Alternatively, calculate ti = 
Dier wee Wij Yij- 
6 SE(fy) = (1/./n) (sample standard deviation of the n estimates in step 4). 
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TABLE 6.4 
Spreadsheet for Calculations in Example 6.6 
Class M; Wi Yij Yj ti ti/Vi 
12 24 0.0371 25.35 295.35 1.5 2.4 57.6 1552.8 
14 100 0.1546 2.5, 2, 3, 0, 0.5 1.6 160.0 1035.2 
14 100 0.1546 3,0:5, 15,23 2.0 200.0 1294.0 


5 76 0.1175 15.2.55,.35.95-2.5 2.8 212.8 1811.6 
1 44 0.0680 4, 4.5, 3, 2,5 a7 162.8 2393.9 
average 1617.5 
std. dev. 521.628 


EXAMPLE 6.6 __Let’s return to the situation in Example 6.4. Now suppose we subsample five students 
in each class rather than observing ¢;. The estimation process is almost the same as 
in Example 6.4. The response yj is the total number of hours student j in class i 
spent studying statistics last week (Table 6.4). Note that class 14 appears twice in the 
sample; each time it appears, a different subsample is collected. 

Thus, ty = 1617.5 and SE(i,,) = 521.628/./5 = 233.28. From this sample, the 
average amount of time a student spent studying statistics is 
* 1617.5 
Ie = Sag = 28 
hours with standard error 233.28/647 = 0.36 hour. 
Here is SAS output finding these estimates (the SAS code is on the website). Note 
that the sum of the weights is 647, which is the number of students (the value is exact 


here, since pps sampling was used; in general, the sum of the weights will estimate 
the number of elements). 


Data Summary 


Number of Clusters 5 
Number of Observations 25 
Sum of Weights 647 
Statistics 
Std Error 
Variable N Mean of Mean 95% CL for Mean 
hours 25 2.500000 0.360555 1.49893848 3.50106152 
Statistics 
Variable Sum Std Dev 95% CL for Sum 


EXAMPLE 6.7 
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Classes were selected with probability proportional to number of students in the 
class, so Ww; = Mj/Mo. Subsampling the same number of students in each class results 
in a self-weighting sample, with each student having weight 


_ MoM _ 647 = 
“7M; 5 (55) 


Wij 
The population total is equivalently estimated as 


25.88(2 + 3+2.54+---+34+2+4+5)= 1617.5. o 


Let’s see what happens if we use unequal-probability sampling on the puppy homes 
considered in Example 5.8. Take yf; proportional to the number of puppies in the home, 
so that Puppy Palace with 30 puppies is sampled with probability 3/4 and Dog’s Life 
with 10 puppies is sampled with probability 1/4. As before, once a puppy home is 
chosen, take an SRS of two puppies in the home. Then if Puppy Palace is selected, 
ty = tpp/(3/4) = (30)(4)/(3/4) = 160. If Dog’s Life is chosen, ty = fp./(1/4) = 
(10)(4)/(1/4) = 160. Thus, either possible sample results in an estimated average of 
Vy = 160/40 = 4 legs per puppy, and the variance of the estimator is zero. = 


Sampling with replacement has the advantage that it is very easy to select the 
sample and to obtain estimates of the population total and its variance. If N is small, 
however, as occurs in many highly stratified complex surveys with few clusters in 
each stratum, sampling with replacement may be less efficient than sampling with- 
out replacement. The next section discusses advantages and challenges of sampling 
without replacement. 


Unequal-Probability Sampling Without 
Replacement 


EXAMPLE 6.8 


Generally, sampling with replacement is less efficient than sampling without replace- 
ment; with-replacement sampling is introduced first because of the ease in selecting 
and analyzing samples. Nevertheless, in large surveys with many small strata, the 
inefficiencies may wipe out the gains in convenience. Much research has been done 
on unequal-probability sampling without replacement; the theory is more complicated 
because the probability that a unit is selected is different for the first unit chosen than 
for the second, third, and subsequent units. When you understand the probabilistic 
arguments involved, however, you can find the properties of any sampling scheme. 


The supermarket example from Section 6.1 can be used to illustrate some of the 
features of unequal-probability sampling with replacement. Here is the population 
again: 
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Store Size (m7) Wi t; (in Thousands) 
1 
A 100 — 11 
16 
2 
B 200 — 20 
16 
@ 300 = 24 
16 
D 1000 = 245 
16 
Total 1600 1 300 


Let’s select two psus without replacement and with unequal probabilities. As in 
Sections 6.1 to 6.3, let 


wW; = P(Select unit i on first draw). 


Since we are sampling without replacement, though, the probability that unit j is 
selected on the second draw depends on which unit was selected on the first draw. 

One way to select the units with unequal probabilities is to use 7; as the probability 
of selecting unit i on the first draw, and then adjust the probabilities of selecting the 
other stores on the second draw. If store A was chosen on the first draw, then for 
selecting the second store we would spin the wheel on page 222 while blocking out 
the section for store A, or shuffle the deck and redeal without Card 1. Thus, 

1 


P(store A chosen on first draw) = 4 = 16 


and 


16 _ ve 


16 


The denominator is the sum of the 7;’s for stores B, C, and D. In general, 


P(B chosen on second draw | A chosen on first draw) = 


P(unit i chosen first, unit k chosen second) 


= P(unit i chosen first) P(unit k chosen second | unit i chosen first) 


Similarly, 


Wi 
L— 
Note that P(unit 7 chosen first, unit k chosen second) is not the same as P(unit k chosen 


first, unit i chosen second): The order of selection makes a difference! By adding the 
probabilities of the two choices, though, we can find the probability that a sample of 


P(unit k chosen first, unit i chosen second) = yx 
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TABLE 6.5 

Inclusion probabilities (z;) and joint inclusion probabilities (z;,) for samples of size 2 that 
could be selected using the method in Example 6.8. The entries of the table are the 7,’s for 
each pair of stores (rounded to four decimal places); the margins give the z;’s for the four 


stores 
Store k 
B Cc D Tj 
A 0.0173 0.0269 0.1458 0.1900 
Sees B - 0.0556 0.2976 0.3705 
C 0.0556 — 0.4567 0.5393 
D 0.2976 0.4567 a 0.9002 
Tk 0.3705 0.5393 0.9002 2.0000 


size 2 consists of psus i and k: 


Wr Wi 


Forn = 2, P(units i and k in sample) = 2% = Wi i 


The probability that psu 7 is in the sample is then 


1 = 2 P(S). 


S:iceS 


Table 6.5 gives the z;’s and zx’s for the supermarkets. um 


6.41 The Horvitz-Thompson Estimator for One-Stage 
Sampling 


Assume we have a without-replacement sample of n psus, and we know the inclusion 
probability 


mz; = P(unit i in sample) 
and the joint inclusion probability 
ix = P(units i and k are both in the sample). 


The inclusion probability 7; can be calculated as the sum of the probabilities of all 
samples containing the ith unit and has the property that 


N 
Yo m=an. (6.17) 
7=1 


For the z.’s, as shown in Theorem 6.1 of Section 6.6, 


N 
So mk = (n—1)mj. (6.18) 
k=1 
ki 
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Because the inclusion probabilities sum to n, we can think of z;/n as the “average 
probability” that a unit will be selected on one of the draws. Recall that for one- 
stage sampling with replacement, ty is the average of the values of t;/w; for psus in 
the sample. But when samples are drawn without replacement, the probabilities of 
selection depend on what was drawn before. Instead of dividing the total ¢; from psu i 
by wi, we divide by the average probability of selecting that unit in a draw, z;/n. We 
then have the Horvitz-Thompson (HT) estimator of the population total (Horvitz 
and Thompson, 1952): 


N 
A tj Lj 
tar = y= = Zi—, (6.19) 


1: 
ieS ~' i=l 


where Z; = | if psu i is in the sample, and 0 otherwise. 
The Horvitz-Thompson estimator is shown to be unbiased for ¢ by using 
Theorem 6.2 in Section 6.6. Here, P(Z; = 1) = 7;, so by (6.38), 


N 
* i 
E[t = 7— Sf. 
[tur] X 1 = 
We shall show in Section 6.6, using Equations (6.39) through (6.41), that the variance 
of the Horvitz—Thompson estimator in one-stage sampling is 


N N WN 
Z 1-7; Nix — Wj 

Ver) = )> 0 + 0 Ai (6.20) 
=) Oe it ei 
eas t, my 

i k 
=5 > dX (m;7% — Tit) (4 = +) : (6.21) 
ki 


The expression in (6.21) is the Sen-Yates-Grundy (SYG) form of the variance (Sen, 
1953; Yates and Grundy, 1953). You can see from (6.21) that the variance of the 
Horvitz-Thompson estimator is 0 if ¢; is proportional to 77;. 

The expressions for the variance in (6.20) and (6.21) are algebraically identical 
(this is shown in Theorem 6.2 of Section 6.6). When the inclusion probabilities 77; or the 
joint inclusion probabilities 7, are unequal, however, substituting sample quantities 
into (6.20) or (6.21) leads to different estimators of the variance. 

The estimator of the variance starting from (6.20), suggested by Horvitz and 
Thompson (1952), is 


A a - Wik — Wj, tj tr 
Vur(ur) = 2s (1-5 + » » oe oe (6.22) 
ieS f ieS keS 
Ai 
The SYG estimator, working from (6.21), is 
me 1 jE — Wig tj tr 2 
Vsyo(ur) = = ' ‘ ' : 6.23 
sya(tat) = 5 SaaS 7 (< “) (6.23) 


igS keS 
k#i 
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Theorem 6.4 in Section 6.6 shows that (6.22) and (6.23) are both unbiased estimators 
of the variance in (6.21). Both require 2, > O for all units in the sample. The SYG 
form in (6.23) is generally the more stable of the two variance estimators. 


Let’s look at the Horvitz-Thompson estimator for a sample of 2 supermarkets in 
Example 6.8 with joint inclusion probabilities given in Table 6.5. We use the draw-by- 
draw method to select the sample. To select the first psu, we generate a random integer 
from {1,..., 16}: the random integer we generate is 12, which tells us that store D 
is selected on the first draw. We then remove the values {7,..., 16} corresponding to 
store D, and generate a second random integer from {1,..., 6}; we generate 6, which 
tells us to select store C on the second draw. The Horvitz-Thompson estimate of the 
total sales for sample {C, D} is then 


tur = >> Seagal eee a 
a “0.9002 © 0.5393 


Since for this example we know the entire population, we can calculate the theoretical 
variance of fy using (6.21): 


2 
V(tur) = 53 3 (ik — Wik) (4 = *) = 4383.6. 


2 wl 


[We obtain the same value, 4383.6, if we use the equivalent formulation in (6.20). ] 
We have two estimates of the variance from sample {C, D}: from (6.22), 


(1 — 0.9002)(245)? (1 — 0.5393)(24) 
(0.9002) (0.5393) 
30-4567 — (0.9002)(0.5393) (_ 245 24 
0.4567 (sans) (ss) 


Vur(@fur) = 


= 6782.8. 
The SYG estimate, from (6.23), is 


Vsyoltur] = 


(0.9002)(0.5393) — 0.4567 / 245 24 
0.4567 0.9002 0.5393 


2 
) = 3259.8. 


Because all values in this population are known, we can examine the estimators 
for all possible samples selected according to the probabilities in Table 6.5. Results 
are given in Table 6.6. For three of the Possible samples, Vardar) is negative! This 
is true even though Vartan) and Veveltar) are unbiased estimators of Vur(fyr); it is 
easy to check for this example that 


Yi P(S)VurGurs)= = > P(S)Vsya(tur,s) = 4383.6. 


possible samples S possible samples S a 


Example 6.9 demonstrates a problem that can arise in estimating the variance 
of fyr: The unbiased estimators in (6.22) or (6.23) can take on negative values in 
some unequal-probability designs! [See Exercise 24 for a situation in which (6.23) is 
negative.] In some designs, the estimates of the variance can be widely disparate for 
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TABLE 6.6 
Variance estimates for all possible without-replacement samples of size 2, for the 
supermarket example 


Sample, S P(S) fur Var (fur) Vsyo(tur) 
{A, B} 0.01726 111.87 —14,691.5 47.1 
{A, C} 0.02692 102.39 —10,832.1 502.8 
{A, D} 0.14583 330.06 4,659.3 7,939.8 
{B, C} 0.05563 98.48 —9,705.1 232.7 
{B, D} 0.29762 326.15 5,682.8 5,744.1 
{C, D} 0.45673 316.67 6,782.8 3,259.8 


different samples. The stability can sometimes be improved by careful choice of the 
sampling design, but in general, the calculations are cumbersome. 

In addition, the estimators in (6.22) and (6.23) can be difficult to use in practice 
because they require knowledge of the joint inclusion probabilities 2 (see Sarndal, 
1996). Since zx appears in the denominator, the joint inclusion probability 7, must 
be strictly positive for every pair of psus. Public use data sets from large-scale surveys 
commonly include a variable of weights that can be used to calculate the Horvitz— 
Thompson estimator. But it is generally impractical to provide the joint inclusion 
probabilities z;,—this would require an additional n(n — 1)/2 values to be included 
in the data set, where n is often large. In addition, for many surveys it is challenging 
to calculate the joint inclusion probabilities zx. 

An alternative suggested by Durbin (1953), which avoids some of the potential 
instability and computational complexity, is to pretend the units were selected with 
replacement and use the with-replacement variance estimator in (6.9) rather than 
(6.22) or (6.23). The with-replacement variance estimator, setting w; = 7;/n, is 


A a 1 1 tj A ‘ n 1; tor : 
Vwr(tur) = t = ‘ 6.24 
wr(tut) oe wr) “r(+ =) (6.24) 


ieS 


The variance estimator in (6.24) is always nonnegative, so you can avoid the potential 
embarrassment of trying to explain a negative variance estimate. In addition, the with- 
replacement variance estimator does not require knowledge of the joint inclusion prob- 
abilities zj,. If without-replacement sampling is more efficient than with-replacement 
sampling, the with-replacement variance estimator in (6.24) is expected to overesti- 
mate the variance and result in conservative confidence intervals (CIs), but the bias 
is expected to be small if the sampling fraction n/N is small. The commonly used 
computer-intensive methods described in Chapter 9 calculate the with-replacement 
variance. 

In general, we recommend using the with-replacement variance estimator in 
(6.24). When the sampling fraction n/N is large, however, this can overestimate the 
variance. Some survey software packages will calculate the SYG variance estimate 
if the user provides the zj,’s. Berger (2004) and Brewer and Donadio (2003) suggest 
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alternatives for estimating V (tr) when the joint inclusion probabilities are unknown. 
These methods are presented in Exercises 29 and 30. 


Let’s select an unequal-probability sample without replacement of size 15 from the 
file agpop.dat. In Example 4.2, we used the variable acres87 as auxiliary informa- 
tion in ratio estimation. We now use it in the sample design, selecting counties with 
probability proportional to acres87. The SAS code used to select and analyze this 
sample is given on the website. The data for the sample, along with the joint inclusion 
probabilities, are in file agpps.dat. 

The Horvitz—Thompson estimate of the total for acres92 is 


: ti 
tor = > = 992,665,083, 
ieS °' 


where 1; is the value of acres92 for county iin the sample. The three variance estimates 
are: Var(iar) = =5,31x 104, Vsyo(@ur) = = 1,22 10!4, and Vr (ur) = =1, 33 x 10!4, 
Because of the instability of Viena). we prefer to use either Veve(ier) or Vesa) 
to estimate the variance. For this sample, VereGar)i is quite close to the SYG estimate 
because the sampling fraction n/N is small. Using the SAS code on the website, 
we obtain SE(fgr) = 11,543,326 = 1.33 x 10!, which is the square root of the 
with-replacement variance estimate. 

Note the gain in efficiency from using unequal-probability sampling. From Exam- 
ple 2.6, an SRS of size 300 gave a standard error of 58,169,381 for the estimated total 
of acres92. The unequal-probability sample has a smaller standard error even though 
the sample size is only 15 because of the high correlation between acres92 and 
acres87. Using the auxiliary information in the variable acres87 in the design results 
in a large gain inefficiency. um 


642 Selecting the psus 


For the supermarkets in Example 6.8, the draw-by-draw selection probabilities w; 
are proportional to the store sizes. The inclusion probabilities 7;’s, however, are not 
proportional to the sizes of the stores—in fact, they cannot be proportional to the 
store sizes, because Store D accounts for more than half of the total floor area but 
cannot be sampled with a probability greater than one. The z;’s that result from this 
draw-by-draw method due to Yates and Grundy (1953) may or may not be the desired 
probabilities of inclusion in the sample; you may need to adjust the y;’s to obtain a 
pre-specified set of z;’s. Such adjustments become difficult for large populations and 
for sample sizes larger than two. 

Many methods have been proposed for selecting psus without replacement so 
that desired inclusion probabilities are attained. Systematic sampling can be used 
to draw a sample without replacement and is relatively simple to implement (hence 
its widespread use), but many of the zs for the population are zero. If psus are 
selected using systematic sampling, you need to use the with-replacement estimator 
of variance in (6.24), since the without-replacement variance estimators in (6.22) and 
(6.23) contain zj, in the denominator and hence are undefined. Brewer and Hanif 
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(1983) present more than 50 methods for selecting without-replacement unequal- 
probability samples. Most of these methods are for n = 2; three of the methods are 
described in Exercises 25, 27, and 28. Some methods are easier to compute, some 
are more suitable for specific applications, and some result in a more stable estimator 
of V(iyr). Tillé (2006) gives general algorithms for selecting without-replacement 
unequal-probability samples. 

SAS software (PROC SURVEYSELECT) will select samples with unequal 
probabilities. The website has examples of SAS programs (example0602.sas and 
ppsselect.sas) that can be used to select without-replacement unequal-probability 
samples. In Example 6.10, we used a method developed by Hanurav (1967) and 
Vijayan (1968) to select the sample. 


6.43 The Horvitz-Thompson Estimator for Two-Stage 
Sampling 


The Horvitz—Thompson estimator for two-stage sampling is similar to the estimator 
for one-stage sampling in (6.19): We substitute an unbiased estimator 7; of the psu 
total for the unknown value of ¢;, obtaining 


7 N A 
a tj i 
tyr = —= Zi—s (6.25) 
d Tj d Tj 
where Z; = | if psu i is in the sample, and 0 otherwise. 

The two-stage Horvitz-Thompson estimator is an unbiased estimator of t as long 
as E[t;] = t; for each psu i (see Theorem 6.2 in Section 6.6). We shall show in 
Section 6.6, using Equations (6.39) through (6.41), that the variance of the Horvitz— 
Thompson estimator is 


Fla ae iT Z V(t) 
» — Ki ik — TiTC i 
Var) = )> 8 + OY A tit + 26) 
a ei ae = 
N 2 N qn 
1 ik V(t) 
ae ea oa oe 6.27 
2 DL ins mu)( # “) a2 7 (6.27) 
k#i 


The expression in (6.27) is again the SYG form. The first part of the variance is the 
same as for one-stage sampling [see (6.20) and (6.21)]. The last term is the additional 
variability due to estimating the f;’s rather than measuring them exactly. 

The Horvitz-Thompson estimator of the variance in two-stage cluster sampling is 


A A E Wik — WT, 1; ? Vi) 
Vardar) = Dd - m5 += = E a +i ==, (6.28) 


L t 


ieS u ieS keS ieS 
ki 
and the SYG estimator is 
‘~ « 2 Avie 
A - 1 Wj, — Wig [ bi tk V(t) 

VsyoGur) = = >> Yo =(- +>>— (6.29) 

24 ik Tj We Tj 

igS keS ieS 


k#i 
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Theorem 6.4 in Section 6.6 shows that both are unbiased estimators of the variance 
in (6.27); however, just as in one-stage sampling, either can be negative in practice. 

For most situations, we recommend using the with-replacement sampling variance 
estimator: 


* 2 & 2 
A. oe I 2 nt; n tj tut 
Vwr(tur) = 4 >, (= i) 4 > (< 7 ) : (6.30) 


ieS icS 


The with-replacement variance estimator for two-stage sampling has exactly the same 
form as the estimator in (6.24) for one-stage sampling; the only difference is that we 
substitute the estimator 7; for the ith psu population total t;. We saw in Section 6.3 
that the with-replacement variance estimator captures the variability at both stages 
of sampling. This results in the tremendous practical advantage that the variance 
estimation method depends only on information at the first-stage level of the design. 
You do not have to use properties of the subsampling design at all for the variance 
estimation. 


644 Weights in Unequal-Probability Samples 


All without-replacement sampling schemes discussed so far in the book can be consid- 
ered as special cases of two-stage cluster sampling with (possibly) unequal probabili- 
ties. The formulas for unbiased estimation of totals in without-replacement sampling 
in Chapters 2, 3, 5, and 6 are special cases of (6.25) through (6.29). In Example 6.15, 
we will derive the formulas in Chapter 5 from the general Horvitz—Thompson results. 
You will show that the formulas for stratified sampling are a special case of Horvitz— 
Thompson estimation in Exercise 18. 
As in earlier chapters, we can write the Horvitz-Thompson estimator using sam- 
pling weights. The first-stage sampling weight for psu i is 
1 
wi=—. 
Tj 
Thus, the Horvitz-Thompson estimator for the population total is 
ter = > wily. 
ieS 
For a without-replacement probability sample of ssus within psus, we define, using 
the notation of Sarndal et al. (1992), 
zt; = P(jth ssu in ith psu included in sample | ith psu is in the sample). 
Then, 
es 
jes il 
The overall probability that ssu j of psu i is included in the sample is 7rj;7r;. Thus, we 
can define the sampling weight for the (7, 7)th ssu as 
1 


TjiTi 


(6.31) 


Wii = 
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and the Horvitz-Thompson estimator of the population total as 
fur = > way. (6.32) 
ieS jES; 


The population mean is estimated by 
ee (6.33) 


The estimator Tar is a ratio, so, using the results from Chapter 4, we estimate its 
variance by forming the residuals from the estimated psu totals. Let 


ej = ti — VorM;, 
where M,; — Tee (1/z;;) estimates the number of ssus in psu i. Note that e;/T = 
ies; Wii — Yur) and )7 5.5 @:/7; = 0. We then use the with-replacement variance 
in (6.30), with é;/Mpo substituted for 7;, to obtain: 


: 4 > wii = a) 
A a = n ej = n JES; 
Yr Our) = —— d (;~) == d ae (6.34) 


keS jeSj 


where Mo = >.6Mi = dies jes; Wij eStimates Mo, the number of ssus in the 
population. Survey software will calculate these quantities for you. 


Let’s take a two-stage unequal-probability sample without replacement from the pop- 
ulation of statistics classes in Example 6.2. We want the psu inclusion probabilities 
to be proportional to the class sizes M; given in Table 6.1. SAS code used to select 
and analyze the sample is given on the website; the data are in file classpps.dat and 
in Table 6.7. 

We calculate the weight for each student in the sample as 


1 1 
a mini  mi(4/Mi) 
Since the same number of students (m; = 4) is selected from each class and since 
the psu inclusion probabilities 7; are proportional to the class sizes M;, the sample of 
students is self-weighting. 
The estimated total number of hours spent studying statistics is 


fur = >> D> wayy = 2232.15. 


ieS jeS; 


This can also be calculated by tur = ves t;/mi = 2232.15. Using the with- 
replacement variance estimate in (6.30), 


me yD 
A x n tl; tuT 5 
Vwr(tur) = = —77,749.9 = 97,187.4 
wr(tuT) > ) rie 9.9 = 97,187.4, 


: Tj n 
iceS 
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TABLE 6.7 
Data from Two-Stage Sample of Introductory Statistics Classes 


; i i, her é 
Class = M; Tj Wij Vij Wii tj - (4 _ **) (55) 
4 22 0.17002 32.39 3 161.750 110.00 646.983 40,222.54 0.09609 
4 22. 0.17002 32.35 4.5 145.575 
+ 22 0.17002 32.35 5.5 177.925 
+ 22 ~=0.17002 32.35 5 161.750 
10 34 = 0.26275 32.35 2 64.700 = 106.25 404.377 1,768.23 0.00423 
10 34 = 0.26275 32.35 4 129.400 
10 34 = 0.26275 32:35. 3 97.050 
10 34 = 0.26275 32.35 3.5 113.225 
1 44 0.34003 32.35 5 161.750 = 154.00 452.901 41.91 0.00010 
1 44 — 0.34003 32.35 3 97.050 
1 44 0.34003 32.35 4 129.400 
1 44 0.34003 32.35 2 64.700 
9 54. 0.41731 32.35 3.5 113.225 = 195.75 469.076 512.96 0.00123 
9 54 0.41731 32.35 4 129.400 
9 54 0.41731 32.35 1 32.350 
9 54 0.41731 32.35 6 194.100 
14 100 0.77280 32.35 2 64.700 200.00 258.799 35,204.25 0.08410 
14 100 0.77280 32.35 1.5 48.525 
14 100 0.77280 32.35 1.5 48.525 
14 100 0.77280 32.35 3 97.050 
Sum 647.00 2232.15 2232.15 77,749.90 0.18574 


giving a standard error of ./97,187.4 = 311.7. For this example, since n = 5 is 
large relative to N = 15, this standard error is likely an overestimate; in Exer- 
cise 14 you will calculate the without-replacement variance estimates in (6.28) and 
(6.29), as well as an approximation to the without-replacement variance used by SAS 
software. 

We estimate the mean number of hours spent studying statistics by 


yD was 


a ieS jES; 2232.15 

Jur = = ay 345. 
pa 
ieS jeSj 


Using (6.34), 


ei) 
~ A n eg \ 5 
VwrOur) = ( m ) = —(0.18574) = 0.23218, 
=e Lge) <3 


so SE(Vyr) = V0.23218 = 0.482. a 
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bo 
Examples of Unequal-Probability Samples 


Many sampling situations are well suited for unequal-probability samples. This sec- 
tion gives three examples of sampling designs in common use. 


EXAMPLE 6.12 Random Digit Dialing. 

In telephone surveys, it is important to have a well-defined and efficient procedure 
to select telephone numbers for the sample. In the early days of telephone surveys, 
many organizations simply took numbers from the telephone directory. That approach 
leads to selection bias, however, because telephone numbers that are unlisted or added 
since publication do not appear in the directory. Modifications of sampling from the 
directory have been suggested to allow inclusion of unlisted numbers, but most have 
some difficulties with undercoverage. 


Random Digit Dialing Element Sampling. Generating telephone numbers at random 
from the frame of all possible telephone numbers avoids undercoverage of unlisted 
numbers. In the United States, telephone numbers consist of 


area code + prefix (or exchange) + suffix 
(3 digits) (3 digits) (4 digits) 


Thus a random sample of telephone numbers in the United States can be chosen 
by randomly selecting a 10-digit number. If the random number chosen does not 
belong to a household, the number is discarded and a new 10-digit number tried. The 
procedure is repeated until the desired sample size is obtained. 

This method is simple to understand and explain, and, assuming no nonresponse, 
produces an SRS of telephone numbers from the frame of all possible telephone num- 
bers. In practice, the method can be expensive: Even with the frame of telephone num- 
bers restricted to area codes and prefixes known to be in use, many telephone numbers 
generated by this method will not belong to a household. Multiple calls to a number 
may be needed to ascertain whether the number is residential or not. 


The Mitofsky Waksberg Method. Mitofsky (1970) and Waksberg (1978) developed a 
cluster-sampling method for sampling residential telephone numbers. The following 
description is of the “sampler’s utopia” procedure in which everyone answers the 
phone (see Brick and Tucker, 2007). 

First, form the sampling frame of psus. Construct a list of all area codes and 
prefixes in the area of interest. Form a list of psus by appending each of the numbers 
00 to 99 to each possible combination of area code and prefix. The resulting list of 
psus consists of the set of possible first eight digits for the 10-digit telephone numbers 
in the population. Each psu in the frame contains the numbers (abc)-def-gh00 to (abc)- 
def-gh99, and is called a 100-bank of numbers. 

The Mitofsky—Waksberg method then uses a method similar to Lahiri’s (1951) 
method to sample psus with probabilities proportional to the number of residential 
telephone numbers. Select a psu at random from the list of all psus, and also select a 
number randomly between 00 and 99 to serve as the last two digits. Dial that telephone 
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number. If the selected number is residential, interview the household and choose its 
psu to be in the sample; the associated psu is the block of 100 telephone numbers that 
have the same first eight digits as the selected number. For example, if the randomly 
selected telephone number (202) 456-1414 is determined to be residential, then the 
psu of all numbers of the form (202) 456-14xx is included in the sample. Continue 
sampling in that psu until a total of k interviews are obtained. If the original number 
selected in the psu is not residential, reject that psu. Continue the procedure until the 
desired number of psus, n, is selected. 

Lepkowski (1988) found that in the late 1980s, 60% of telephone numbers chosen 
with the Mitofsky—Waksberg method reached households, compared with 25% for 
random digit element sampling. The method worked well because the psus of 100 
telephone numbers were clustered—some psus were unassigned, some tended to 
be assigned to commercial establishments, and some were largely residential. The 
procedure eliminates sampling unassigned psus at the second stage, and reduces the 
probability of selecting psus with few residential telephone numbers. 

Under ideal conditions, the Mitofsky-Waksberg procedure samples psus with 
probabilities proportional to the number of residential telephone numbers in the psus. 
If the second stage prescribes selecting an additional (k — 1) residential telephone 
numbers in each sampled psu, and if all psus in the sample have at least k residential 
telephone numbers, then the Mitofsky—Waksberg procedure gives each residential 
telephone number the same probability of being selected in the sample—the result is 
a self-weighting sample of residential telephone numbers. 

To see this, let M; be the number of residential telephone numbers in psu 7, and 
let N be the total number of psus in the sampling frame. The probability that psu i is 
selected to be in the sample on the first iteration of the procedure is M;/Mo, where 
Mo = ae M,; (see Exercise 32), even though the values of M; and Mo are unknown. 
Then, if each psu in the population has either M; = 0 or M; > k, 


P(number selected) = P(psu i selected) P(number selected | psu 7 selected) 
M; k k 
MoM; — Mo’ 

The sampling weight for each number in the sample is Mo/k; to estimate a pop- 
ulation total, you would need to know Mp, the total number of residential telephone 
numbers in the population. To estimate an average or proportion, the typical goal 
of telephone surveys, you do not need to know Mp. You only need to know a “rela- 


tive weight” w, for each response y, in the sample, and can estimate the population 
mean as 


it 


ieS jES; 


j= — 


Low 


iEeS jeS; 


Here, with a self-weighting sample, you can use relative weights of wj = 1. 
Note that although under ideal conditions the Mitofsky—Waksberg method leads 
to a self-weighting sample of residential telephone numbers, it does not give a 


EXAMPLE 6.13 


6.5 Examples of Unequal-Probability Samples 25] 


self-weighting sample of households—some households may have more than one 
telephone number; others may not have a telephone. In practice, someone using the 
Mitofsky—Waksberg method would adjust the weights to compensate for multiple 
telephone lines and nonresponse, as will be discussed in Chapter 8. 

Although in ideal situations the Mitofsky—Waksberg method produces a self- 
weighting sample of residential telephone numbers, those ideal situations are rarely 
encountered in practice. The inclusion of a psu in the sample depends on the determina- 
tion of whether the first number dialed is residential or not. But a household belonging 
to that number may not respond, or may require many attempts to be reached, which 
delays the decision about whether to include that psu in the sample or results in an 
incorrect rejection of the psu. Many survey organizations currently use list-assisted 
random digit dialing (Casady and Lepkowski, 1993), in which telephone numbers 
are selected from 100-banks constructed from telephone directories. The telephone 
numbers in a 100-bank are included in the sampling frame if the directory contains 
at least one telephone number in that 100-bank. The 100-banks with no numbers in 
the directory are not included in the sampling frame. With list-assisted random digit 
dialing, there is undercoverage of households that are in a 100-bank where everyone 
has an unlisted number, but the undercoverage is thought to be small. Tucker et al. 
(2002) discuss these methods in view of changes in the assignment of residential tele- 
phone numbers. The increased prevalence of cell-only households has increased the 
coverage problems of random digit dialing surveys based on directories of landline 
numbers; Lavrakas et al. (2007) discuss the challenges involved in sampling cellular 
telephone households. = 


3-P Sampling 

Probability Proportional to Prediction (3-P) sampling, described by Schreuder 
et al. (1968), is commonly recommended as a sampling scheme in forestry. Suppose 
an investigator wants to estimate the total volume of timber in an area. Several options 
are available: (1) Estimate the volume for each tree in the area. There may be thousands 
of trees, however, and this can be very time consuming. (2) Use a cluster sample in 
which plots of equal areas are selected, and the volume of every tree in the selected 
plots measured. (3) Use an unequal-probability sampling scheme in which points in 
the area are selected at random, and the trees closest to the points are included in the 
sample. In this design, a tree is selected with probability proportional to the area of 
the region that is closer to that tree than to any other tree. (4) Estimate the volume of 
each tree by eye and then select trees with probability proportional to the estimated 
volume. When done in one pass, with trees selected as the volume is estimated, this 
is 3-P sampling—the prediction P stands for the predicted (estimated) volume used 
in determining the 7;’s. 

The largest trees tend to produce the most timber and contribute most to the 
variability of the estimate of total volume. Thus, unequal-probability sampling can 
be expected to lead to less sampling effort. Theoretically, you could estimate the 
volume of each of the N trees in the forest by eye, obtaining a value x; for tree 7. Then, 
you could revisit trees randomly selected with probabilities proportional to x;, and 
carefully measure the volume ¢;. Such a procedure, however, requires you to make two 
trips through the forest and adds much work to the sampling process. In 3-P sampling, 
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only one trip is made through the forest, and trees are selected for the sample at the 
same time the x;’s are measured. The procedure is as follows: 


1 Estimate or guess what the maximum value of x; for the trees is likely to be. Define 
a value L that is larger than your estimated maximum value of xj. 


2 Proceed to a tree in the forest, and determine x; for that tree. Generate a random 


number uy; in [0, L]. If u; < x;, then measure the volume y; on that tree; otherwise, 
skip that tree and go on to the next tree. 


3 Repeat step 2 on every tree in the forest. 


The unequal-probability sampling in this case essentially gives every board-foot 
of timber an equal chance of being selected for the sample. Note that the size of the 
unequal-probability sample is unknown until sampling is completed. The probability 
that tree i is included in the sample is 2; = x;/L. The Horvitz-Thompson estimator is 


N 
a yi Ji Ji 
tat = —=L)— =) 4. 
where Z; = | if tree i is in the sample, and 0 otherwise. The Z;’s are independent 
Bernoulli random variables (Z; has success probability 7; ), so 3-P sampling is a special 
case of a method known as Poisson sampling. The sample size is the random variable 
>”, Z with expected value >”, x;/L. 

Because the sample size is variable rather than fixed, Poisson sampling provides a 
different method of unequal-probability sampling than those discussed in Sections 6. 1 
through 6.4. Sarndal et al. (1992) give additional theory and references for Poisson 
sampling. sm 


Unequal-probability methods are common in natural resource sampling. Overton 
and Stehman (1995) give a number of other examples. 


Dollar Unit Sampling. 

An accountant auditing the accounts receivable amounts for a company often takes 
a sample to estimate the true total accounts receivable balance. The book value x; is 
known for each account in the population; the audited value f; will be known only for 
accounts in the sample. In Section 4.3 we saw how the auxiliary information x; could 
be used in difference estimation to improve the precision from an SRS of accounts. 
Ratio or regression estimation could be used similarly. 

Instead of being used in the analysis, the book values could be used in the design 
of the sample. You could stratify the accounts by the value of x;, or you could take 
an unequal-probability sample with inclusion probabilities proportional to x;. (Or 
you could do both: First stratify, then sample with unequal probabilities within each 
stratum.) If you sample accounts with probabilities proportional to x;, then each 
individual dollar in the book values has the same probability of being selected in 
the sample (hence the name dollar unit sampling). With each dollar equally likely to 
be included in the sample, an account with book value $10,000 is ten times as likely 
to be in the sample as an account with book value $1000. 

Consider a client with 87 accounts receivable, with a book balance of $612,824. 
The auditor has decided that a sample of size 25 will be sufficient for estimating 
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TABLE 6.8 
Account Selection for Audit Sample 


Account Book Cumulative Random 
(Audit Unit) Value Book Value Number 
1 2,459 2,459 
2 2,343 4,802 
3 6,842 11,644 11,016 
4 4,179 15,823 
5 750 16,573 
6 2,708 19,281 
7 3,073 22,354 
8 4,742 27,096 
9 16,350 43,446 31,056 38,500 
10 5,424 48,870 
11 9,539 58,409 
12 3,108 61,517 
13 3,935 65,452 63,047 
14 900 66,352 


the error in accounts receivable and takes a random sample with replacement of the 
612,824 dollars in the book value population. As individual dollars can only be audited 
as part of the whole account, each dollar selected serves as a “hook” to snag the whole 
account for audit. The cumulative-size method is used to select psus (accounts) for 
this example; often, in practice, auditors take a systematic sample of dollars and their 
accompanying psus. A systematic sample guarantees that accounts with book values 
greater than the sampling interval will be included in the sample. Table 6.8 shows the 
first few lines of the account selection; the full table is in file auditselect.dat. Here, 
accounts 3 and 13 are included once, and account 9 is included twice (but only needs 
to be audited once since this is a one-stage cluster sample). This is thus an example 
of one-stage pps sampling with replacement, as discussed in Section 6.2. 

The selected accounts are audited, and the audit values are recorded in file auditre- 
sult.dat. The overstatement in each sampled account is calculated as (book value — 
audit value). Table 6.9 gives part of a spreadsheet (the full spreadsheet is on the 
website) that may be used to estimate the total overstatement. Using the results from 
Section 6.2, the total overstatement is estimated from (6.5) to be $4334 with stan- 
dard error $13,547/./25 = $2709 from (6.6). In many auditing situations, however, 
most of the audited values agree with the book values, so most of the differences are 
zeros. A CI based on a normal approximation does not perform well in this situation, 
so auditors typically use a CI based on the Poisson or multinomial distribution (see 
Neter et al., 1978) rather than a CI of the form (7 + 1.96 SE). 

Another way of looking at the unequal-probability estimate is to find the over- 
statement for each individual dollar in the sample. Account 24, for example, has a 
book value of $7090 and an error of $40. The error is prorated to every dollar in the 
book value, leading to an overstatement of $0.00564 for each of the 7090 dollars. The 
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TABLE 6.9 
Results of the Audit on Accounts in the Sample 


Account Book Audit BV — AV Difference 
(Audit Unit) | Value (BV) Wi Value (AV) | Difference | Diff/w; per Dollar 

3 6,842 | 0.0111647 6,842 0 0 0.00000 

9 16,350 | 0.0266798 16,350 0 0 0.00000 

9 16,350 | 0.0266798 16,350 0 0 0.00000 

13 3,935 | 0.0064211 3,935 0 0 0.00000 

24 7,090 | 0.0115694 7,050 40 3,457 0.00564 

29 5,533 | 0.0090287 5,533 0 0 0.00000 

75 2,291 | 0.0037384 2,191 100 | 26,749 0.04365 

79 4,667 | 0.0076156 4,667 0 0 0.00000 

81 31,257 | 0.0510049 31,257 0 0 0.00000 

average 4,334 | 0.007071874 

std. dev. 13,547 0.02210527 


average overstatement for the individual dollars in the sample is $0.007071874, so 
the total overstatement for the population is estimated as (0.007071874)(612824) = 
4334. om 


6.6 


Randomization Theory Results and Proofs* 


In two-stage cluster sampling, we select the psus first and then select subunits within 
the sampled psus. One approach to calculate a theoretical variance for any estimator 
in multistage sampling is to condition on which psus are included in the sample. 
To do this, we need to use Properties 4 (successive conditioning) and 5 (calculating 
variances conditionally) of conditional expectation, stated in Section A.4. 

In this section, we state and prove Theorem 6.2, the Horvitz~-Thompson Theorem 
(Horvitz and Thompson, 1952), which gives the properties of the estimator in (6.25). 
In Theorem 6.4, we find unbiased estimators of the variance. We then show that the 
variance for cluster sampling with equal probabilities in (5.21) follows as a special 
case of these theorems. First, however, we prove (6.17) and (6.18). 

Throughout this section, let 


(6.35) 


1 if psu is in the sample 
Zr = ; si : 
0 if psu i is not in the sample 


denote the random variable specifying whether psu 7 is included in the sample or not. 
The probability that psu 7 is included in the sample is 
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the probability that both psu i and psu k (i  k) are included in the sample is 


Wig = P(Zj = land Z = 1) = E(ZZ,). (6.37) 


THEOREM 6.1 


For a without-replacement probability sample of n units, let Z;, 7;, and mx be as 
defined in (6.35)—(6.37). Then 


and 


N 
So rik = (n — Ir. 
k=1 
k#i 

Proof Since the sample size is n, ye , Z; =n for every possible sample. Also, 


E[Z;] = E[Z?] = x; 


because P(Z; = 1) = 2. Consequently, 


N N 
n=e[So2) = on. 
i=! i=l 


In addition, 


N N 
Yo te = YO FIZZ) = ElZ(n — Z)] = in — 1, 
ne? 


which completes the proof. m= 


THEOREM 6.2 
Horvitz-Thompson 


Let Z;, ;, and zx be as defined in (6.35)—-(6.37). Suppose that sampling is done at the 
second stage so that sampling in any psu is independent of the sampling in any other 
psu, and that 7; is independent of (Z),...,Zy) with Eft] = Eli Zips sshens Ly | Stee 
Then 


N i, N tj 
E aT =L =" (6.38) 
i=1 


i=1 


and 


N A 
ti 
Vv bp 2 = Vosu + Vesu, (6.39) 
i=1 


Proof 
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where 


N 
Vac yz = » (l- nyt + » 2 (rik — mim) (6.40) 
kei 


i=1 


and 
N A 
V(t) 
Vysu = . ae 
3 7 (6.41) 
First note that 
_ja(l—7) ifi=k 
Cov (Zj, Ze) = CS —mim, ifi fk. 


We use successive conditioning to show (6.38): 


e[yrat|=ele|y 2 4l2.... z|| - e[yrat|= pos joer. 


i=l ! 
The first step above simply applies successive conditioning; in the second step, we 
use the independence of 7; and (Zysaccey Zn) 
To find the variance, we use the expression for calculating the variance con- 
ditionally in Property 5 of Section A.4, and again use the independence of 7; and 
(Z1,...,Zy): 


N N 

i Vii) 
~~ Lis Z L 
Y= cov’ Ot Lima 


Equation (6.38) establishes that the Horvitz-Thompson estimator is unbiased, and 
(6.39) through (6.41) show that (6.26) is the variance of the Horvitz-Thompson esti- 
mator. In one-stage cluster sampling, VG) = Ofori € S,s0 Vegu = Oand V (tyr) = Vosu 
as given in (6.20). 

We now show that the Horvitz-Thompson form of the variance in (6.20) and the 
SYG form in (6.21) are equivalent. 
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THEOREM 6.3 
Let Vpsu be as defined in (6.40). Then 


2] 


Proof 
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which shows the equality of the two expressions for the variance. = 


Theorem 6.4 shows that (6.28) and (6.29) are unbiased estimators for the variance 
in (6.26) and (6.27); the one-stage variance estimators in (6.20) and (6.21) follow as 
a special case when V(i;) =0 


THEOREM 64 


Suppose the conditions of Theorem 6.2 hold, and that V(i,) is an unbiased estimator 
of V(;) that is independent of Z;. Then, 


Ee |" 
SoZ ai) (6.42) 
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ik #hy ii i 
e[oza-m! B yy za Mae i) 


Wik 


ge i=l k=1 
k#i 
1 ae IUjITk — Wik t tk 2 
28 a, Gee) 
2 > dX ‘ Wik Tj; WM 
ksi 
“ Vii) 
= Vpsu+ (1 — 77) 5 (6.43) 
i=l ae 
and 
E [ FarrCire)| =E [ PsveCun)| = Vpsu + Vesu- (6.44) 


Proof We prove (6.42) by using successive conditioning: 


V tj Vit; 4 1; 4 1; 
ela 2) =2[e(z ra |2)] E|z |= (i) 
1; TT; Tj 
Result (6.42) follows by summation. 


To prove (6.43), note that because t; and (Z,,...,Zy) are independent, 
E[? | Z,...,Zy] = El?) = ? + V@). 


Thus, 
N 92 N 92 
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Because subsampling is done independently in different psus, Eli] = tt, for 
k #i, so 
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Combining the two results, we see that 


N 
e| yaa - ie = 3 57 7.2478 — fie | Vou + > ry. 
Wik i=1 
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which proves the first part of (6.43). We show the second part of (6.43) similarly, 
using results from Theorem 6.1: 
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Equation (6.44) follows because 


icS 


EXAMPLE 6.15 We now show that the results in Section 5.3 are special cases of Theorems 6.2 and 
6.4. If psus are selected with equal probabilities, 


n 
P(Z;=\=n;= -, 
nn—-1l 


P(Z,=l\andZ=1l)=7, = = ; 
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and 
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so we can apply Theorem 6.2 with 2; = n/N. Then, 
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and, from (6.40), 
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By result (2.9) from SRS theory, 
Vii) = MP(1- 4) =, 


so, using (6.41), 
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This completes the proof of (5.21). In the special case of an SRS, t; = y; and S? is the 
variance among population elements, so Vps, reduces to the formula in (2.16). 
For two-stage cluster sampling with equal probabilities, we defined 


to be the sample variance among the estimated psu totals in (5.22). We now show 
that, when z; = n/N and mz, = n(n — 1)/[N(N — 1], 
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so that Theorem 6.4 can be applied. Substituting n/N for 7; and n(n — 1)/[N(N — 1)] 
for Tix, 
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-(i-9)4 
Thus, by (6.43), 
efr(t-2)2]=e 0-H S202 v as 
i=l 


Note that the expected value of s? is larger than S?: s? includes the variation from psu 
total to psu total, plus variation from not knowing the psu totals. 
Because 
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My 
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is an unbiased estimator of V(7;), Theorem 6.4 implies that 


2 (%) ra] =e[ (2) 20]= vm 


i=1 icS 


Using (6.45), then, 
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so (5.24) is an unbiased estimator of (5.21). = 


The methods used in these proofs can be applied to any number of levels of 
clustering. You may want to sample schools, then classes within schools, then 
students within classes. Exercise 36 asks you to find an expression for the vari- 
ance in three-stage cluster sampling. Rao (1979a) presents an alternative and ele- 
gant approach, relying on properties of nonnegative definite matrices, for deriving 
mean squared errors and variance estimators for linear estimators of population 
totals. 


b/ 
Models and Unequal-Probability Sampling* 


In general, data from a good sampling design should produce reasonable inferences 
from either a model-based or randomization approach. Let’s see how the Horvitz— 
Thompson estimator performs for Model M1 from (5.34). The model is 


M1: ¥y = w+A; 4+ 8 


with the A;’s generated by a distribution with mean 0 and variance ee the ¢;;’s gener- 
ated by a distribution with mean 0 and variance o?, and all A;’s and €j'8 independent. 

As we did for the estimators in Chapter 5, we can write the estimator as a linear 
combination of the random variables Y;;. For a pps design, Wj; = M;/Mo and z; = 
ny;, SO 


Mo 


= DM =Ly ew 


jes icS ieS jES; me 


Tp = 


6.7 Models and Unequal-Probability Sampling 263 


Note that )° 5-5 vies; Mo/(nm;) = Mo, so Tp is unbiased under Model M1 in (5.34). 
In addition, from (5.36), 
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The model-based variance for Tp has implications for design. Suppose a sample is 
desired that will minimize Vy [7p — T]. The psu sizes M; for the sample units appear 
only in the term —207(Mo /n) pa Mi, so for fixed n the variance is smallest when 
the n units with largest M;’s are included in the sample. If, in addition, a constraint 
is placed on the number of subunits that can be examined, ye g (1/m;) is smallest 
when all m;’s are equal. 

Inference in the model-based approach does not depend on the sampling design. 
As long as model M1 holds for the population, Tp is model-unbiased with variance 
given above. In a model-based approach, an investigator with complete faith in the 
model can simply select the psus with the largest values of M; to be the sample. In 
practice, however, this would not be done—no one has complete faith in a model, 
especially before data collection. Royall and Eberhardt (1975) suggested using bal- 
anced sampling, in which the sample is selected in such a way that inferences are 
robust to certain forms of model misspecification. 

As described in Section 6.2, pps sampling can be thought of as a way of intro- 
ducing randomness into the optimal design for model M1 and estimator Tp. The 
self-weighting design of taking all m;’s to be equal also minimizes the variance in 
the model-based approach. Thus, if model M1 is thought to describe the data, pps 
sampling and estimation should perform well in practice. 

We conclude our discussion with a widely quoted example from Basu (1971, 
pp. 212-213), often used to argue that Horvitz-Thompson estimates can be as silly 
as any other statistical procedures improperly applied. 


The circus owner is planning to ship his 50 adult elephants and so he needs a rough 
estimate of the total weight of the elephants. As weighing an elephant is a cumbersome 
process, the owner wants to estimate the total weight by weighing just one elephant. 
Which elephant should he weigh? So the owner looks back on his records and discovers 
a list of the elephants’ weights taken 3 years ago. He finds that 3 years ago Sambo the 
middle-sized elephant was the average (in weight) elephant in his herd. He checks with 
the elephant trainer who reassures him (the owner) that Sambo may still be considered 
to be the average elephant in the herd. Therefore, the owner plans to weigh Sambo 
and take 50 y (where y is the present weight of Sambo) as an estimate of the total 
weight Y = Y; + Y; +---+ Ys of the 50 elephants. But the circus statistician is 
horrified when he learns of the owner’s purposive sampling plan. “How can you get 
an unbiased estimate of Y this way?” protests the statistician. So, together they work 
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out a compromise sampling plan. With the help of a table of random numbers they 
devise a plan that allots a selection probability of 99/100 to Sambo and equal selection 
probabilities of 1/4900 to each of the other 49 elephants. Naturally, Sambo is selected 
and the owner is happy. “How are you going to estimate Y?’, asks the statistician. 
“Why? The estimate ought to be 50y of course,” says the owner. “Oh! No! That cannot 
possibly be right,” says the statistician, “I recently read an article in the Annals of 
Mathematical Statistics where it is proved that the Horvitz-Thompson estimator is the 
unique hyperadmissible estimator in the class of all generalized polynomial unbiased 
estimators.” “What is the Horvitz-Thompson estimate in this case?” asks the owner, 
duly impressed. “Since the selection probability for Sambo in our plan was 99/100,” 
says the statistician, “the proper estimate of Y is 100y/99 and not S5Oy.” “And how 
would you have estimated Y,” inquires the incredulous owner, “if our sampling plan 
made us select, say, the big elephant Jumbo?” “According to what I understand of 
the Horvitz—Thompson estimation method,” says the unhappy statistician, “the proper 
estimate of Y would then have been 4900y, where y is Jumbo’s weight.” That is how 
the statistician lost his circus job (and perhaps became a teacher of statistics!) 


Should the circus statistician have been fired? A statistician desiring to use a 
model in analyzing survey data would say yes: The circus statistician is using the 
model y; « 99/100 for Sambo, and y; « 1/4900 for all other elephants in the herd— 
certainly not a model that fits the data well. A randomization-inference statistician 
would also say yes: Even though models are not used explicitly in the Horvitz— 
Thompson theory, the estimator is most efficient (has the smallest variance) when 
the psu total (here, y;) is proportional to the probability of selection. The silly design 
used by the circus statistician leads to a huge variance for the Horvitz-Thompson 
estimator. If that were not reason enough, the statistician proposes a sample of size 
1—he can neither check the validity of the model in a model-based approach nor 
estimate the variance of the Horvitz—Thompson estimator! 

Had the circus statistician used a ratio estimator in the design-based setting, he 
might have saved his job even though he did not use a good design. He wants to 
estimate the population total, ¢ (called Y in Basu’s paper). The ratio estimator is 


z ty 
bp = hy. 
ty 
Thus, if Sambo is selected, 
os Ysambo / Sambo _ YSambo 


t a Xe 


= 
XSambo /Tsambo XSambo 


Similarly, if Jumbo is selected, 


a YJumbo 
hy = ty. 
XJumbo 


With the ratio estimator, the total weight of the elephants from three years ago is 
multiplied by the ratio of (weight now)/(weight 3 years ago) for the selected elephant. 
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Chapter Summary 


Unequal-probability samples occur naturally in many situations, particularly in cluster 
sampling when the psus have unequal sizes. If the psu population totals ¢; are highly 
correlated with w;, then an unequal-probability sampling design can greatly increase 
efficiency. All estimators studied so far in the book can be viewed as special cases of 
the estimators used in unequal-probability sampling. 

We can draw an unequal-probability sample either with or without replacement. 
Selecting a with-replacement sample with unequal probabilities is simple; on each of 
the n draws, select one of the NV psus with specified probability w;, where se W= 1. 
Since any psu can be selected on each of the n draws, a psu can appear more than 
once in the sample. 

Estimation is also simple in a with-replacement probability sample, and the esti- 
mators have the same form for either a one-stage or a multi-stage sample. The popu- 
lation total is estimated by ty given in (6.5) and (6.13): 


=iys = 2D MiNi 


Nin iER jEeSj 


where FR denotes the set of psus selected for the sample (including psus as many 
times as they are selected). If an SRS of m; of the M; ssus is taken at stage 2, then 
wi = [1/(nWi)](M;/m;). An unbiased estimator of Viiy) is given by 


A 


11 i, . 
Vity) = 2 fe hs; 
oe a ’) 


which is simply the sample variance of the values of 7;/y; divided by n. If a psu 
appears more than once in the sample, each time a different probability subsample 
of ssus is selected for estimating ¢;. In with-replacement sampling, the an 
mean y, is estimated using (6.10) by y Fy = = ty /Moy, where Moy = oe jes; W 


Equation (6.15) gives Voy) using ratio estimation methods. 

Although the estimators of means and totals in without-replacement unequal- 
probability sampling have simple form, variance estimation and sample selection 
methods can be complicated. We recommend using software such as SAS PROC 
SURVEYSELECT to select an unequal-probability sample without replacement when 
n/N is large. 

With z; = P(psu i is included in the sample, S), the Horvitz-Thompson estimator 
of the population total for a without-replacement sample is 
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where ; is an unbiased estimator of the psu total ¢;. The sampling weight for ssu j of 
psu 7 is 


1 1 


zt; P(ssu j of psu i is in sample | psu i is in sample)’ 


Wij = 


in terms of the weights, 


fur = i > Wij Vij 


ieS jES; 


and 


Dm 
iceS jes; 


YuT = yo wy , 


ieS jES; 


The variance of the Horvitz-Thompson estimator is given in (6.20) and (6.21) for 
one-stage cluster sampling and in (6.26) and (6.27) for two-stage cluster sampling. 
The SYG unbiased estimator of V (fy), given in (6.29), requires knowledge of the joint 
inclusion probabilities 2, = P(psus i and j are included in the sample) and is often 
difficult to compute. In many situations, we recommend using the with-replacement 
variance estimators in (6.30) and (6.34), which do not require knowledge of the zx’s; 
if n/N is small, a with-replacement variance estimator performs well. 

Unequal-probability sampling is used in many large-scale government surveys to 
improve efficiency. It is also frequently used in telephone surveys and natural resource 
surveys. 


Key Terms 


Horvitz-Thompson estimator: The Horvitz-Thompson estimator of a population 
total t is tyr = ae Ss tj /1;. This is the most general form of the estimator of f in 
without-replacement samples with inclusion probabilities 7;. 


Inclusion probability: zz; is the probability that psu 7 is included in the sample. 


Joint inclusion probability: 7, is the probability that psus i and j are both included 
in the sample. 


Poisson sampling: A sampling process in which independent Bernoulli trials deter- 
mine whether each unit in the population is to be included in the sample. 


Probability proportional to size (pps) sampling: Unequal-probability sampling 
method in which the probability of sampling a unit is proportional to the number of 
elements in the unit. 


Random digit dialing: A method used in telephone surveys in which a proba- 
bility sample of telephone numbers is selected from the set of possible telephone 
numbers. 


ih 


Exercises 


6.9 Exercises 16] 


For Further Reading 


Overton and Stehman (1995) give aclearly written overview and examples of unequal- 
probability sampling. Chapter 9 of Brewer (2002) discusses the Horvitz-Thompson 
estimator and methods for approximating its variance. Brewer and Hanif (1983) 
present more than 50 methods for drawing with- and without-replacement samples 
with unequal probabilities. Tillé (2006) presents algorithms for selecting unequal- 
probability samples. Tillé also describes how to select balanced samples, in which a 
sample is designed so that estimated population totals of auxiliary variables equal the 
true population totals of those variables. Programs in the R statistical programming 
language that will select balanced samples are given in Matei and Tillé (2005). 

Rao (2005) outlines the history of how practical problems have spurred devel- 
opment of survey methods, with an interesting section on the history of unequal- 
probability sampling. Hansen and Hurwitz (1943) develop the theory of pps sampling 
with replacement. Horvitz and Thompson (1952) extend the work of Hansen and 
Hurwitz to unequal-probability sampling without replacement. 


A. Introductory Exercises 


For each of the following situations, say what unit might be used as psu. Do you 
believe there would be a strong clustering effect? Would you sample psus with equal 
or unequal probabilities? 


a You want to estimate the percentage of patients of U.S. Air Force optometrists 
and ophthalmologists who wear contact lenses. 


b Human taeniasis is acquired by ingesting larvae of the pork tapeworm in inad- 
equately cooked pork. You have been asked to design a survey to estimate the 
percentage of inhabitants of a village who have taeniasis. A medical examination 
is required to diagnose the condition. 


ec You wish to estimate the total number of cows and heifers on all Ontario dairy 
farms; in addition, you would like to find estimates of the birth rate and stillbirth 
rate. 


d You want to estimate the percentages of undergraduate students at U.S. universities 
who are registered to vote, and who are affiliated with each political party. 


e A fisheries agency is interested in the distribution of carapace width of snow crabs. 
A trap hauled from a fishing boat has a limit of 30 crabs. 


f You wish to conduct a customer satisfaction survey of persons who have taken 
guided bus tours of the Grand Canyon rim area. Tour groups range in size from 8 
to 44 persons. 


An investigator wants to take an unequal-probability sample of 10 of the 25 psus in 
the population listed below and in file exerciseQ602.dat, and wishes to sample units 
with replacement. 
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psu Wi psu vi 
1 0.000110 14 0.014804 
2 0.018556 15 0.005577 
3 0.062998 16 0.070784 
4 0.078216 17 0.069635 
5 0.075245 18 0.034650 
6 0.073983 19 0.069492 
7 0.076580 20 0.036590 
8 0.038981 21 0.033853 
9 0.040772 22 0.016959 
10 0.022876 23 0.009066 
11 0.003721 24 0.021795 
12 0.024917 25 0.059186 
13 0.040654 


a Adapt the cumulative-size method to draw a sample of size 10 with replacement 
with probabilities w;. Instead of randomly selecting integers between | and Mp = 
saat M,;, select 10 random numbers between 0 and 1. 


b Adapt Lahiri’s method to draw a sample of size 10 with replacement with proba- 
bilities Wi. 


For the supermarket example in Section 6.1, suppose that the w;’s are the same, but 
that each store has t; = 75. What is E [iy]? Viiv]? 


For the supermarket example in Section 6.1, suppose that the 7;’s are 7/16 for 
store A, and 3/16 for each of stores B, C, and D. Show that ty is unbiased, and 
find its variance. Do you think that the sampling scheme with these y;’s is a good 
one? 


Return to the supermarket example of Section 6.1. Now let’s select two supermarkets 
with replacement. List the 16 possible samples (A,A), (A,B), etc., and find the proba- 
bility with which each sample would be selected. Calculate ty for each sample. What 
is E[ty]? V[ty]? 


The file azcounties.dat gives data from the 2000 U.S. Census on population and 
housing unit counts for the counties in Arizona (excluding Maricopa County and 
Pima County, which are much larger than the other counties and would be placed in a 
separate stratum). For this exercise, suppose that year 2000 population (M;) is known 
and you want to take a sample of counties to estimate the total number of housing 
units (f = a t;). The file has the value of t; for every county so you can calculate 
the population total and variance. 


a Calculate the selection probabilities y; for a sample of size | with probability 
proportional to 2000 population. Find ty for each possible sample, and calculate 
the theoretical variance V(iy). 

b Repeat (a) for an equal probability sample of size 1. How do the variances com- 
pare? Why do you think one design is more efficient than the other? 


ce Now take a with-replacement sample of size 3. Find ty and Vity) for your sample. 


9 
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For a simple random sample with replacement, with wy; = 1/N, show that (6.6) 
simplifies to 


hes N? 1 =o 
V(ty,) = t; —t), 
Qe 
iceR 
where the sum is over all 7 units in the sample (including units as many times as they 
appear in the sample). 


Let’s return to the situation in Exercise 6 of Chapter 2, in which we took an SRS to 
estimate the average and total numbers of refereed publications of faculty and research 
associates. Now, consider a pps sample of faculty. The 27 academic units range in 
size from 2 to 92. We used Lahiri’s method to choose 10 psus with probabilities 
proportional to size and with replacement, and took an SRS of four (or fewer, if 
M; < 4) members from each psu. Note that academic unit 14 appears three times in 
the sample; each time it appears, a different subsample was collected. 


Academic 
Unit M; Wi ij 
14 65 0.0805452 3, 0, 0, 4 
23 25 0.0309789 2,1,2,.0 
9 48 0.0594796 0, 0, 1,0 
14 65 0.0805452 2,0, 1,0 
16 2 0.0024783 2,0 
6 62 0.0768278 0: 2.235 
14 65 0.0805452 1, 0, 0, 3 
19 62 0.0768278 4,1, 0,0 
21 61 0.0755886 23231 
11 41 0.0508055 25951253 


Find the estimated total number of publications, along with its standard error. 


B. Working with Survey Data 


The file statepps.dat lists the number of counties, land area, and 1992 population for 

the 50 states plus the District of Columbia. 

a Use the cumulative-size method to draw a sample of size 10 with replacement, 
with probabilities proportional to land area. What is y; for each state in your 
sample? 

b Use the cumulative-size method to draw a sample of size 10 with replacement, 
with probabilities proportional to population. What is w; for each state in your 
sample? 

ec How do the two samples differ? Which states tend to be in each sample? 

Use your sample of states drawn with probability proportional to population, from 

Exercise 9, for this problem. 


a_ Using the sample, estimate the total number of counties in the United States, and 
find the standard error of your estimate. How does your estimate compare with 


11 


12 


13 


14 
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the true value of total number of counties (which you can calculate, since the file 
statepps.dat contains the data for the whole population)? 


b Now suppose that your friend Tom finds the ten values of numbers of counties in 
your sample, but does not know that you selected these states with probabilities 
proportional to population. Tom then estimates the total number of counties using 
formulas for an SRS. What values for the estimated total and its standard error are 
calculated by Tom? How do these values differ from yours? Is Tom’s estimator 
unbiased for the population total? 


In Example 2.5, we took an SRS to estimate the total acreage devoted to farming 
in the United States in 1992. Now, use the sample of states drawn with probability 
proportional to land area in Exercise 9, and then subsample five counties randomly 
from each state using file agpop.dat. Estimate the total acreage devoted to farming in 
1992, along with its standard error. 


The file statepop.dat, used in Example 6.5, also contains information on total number 

of farms, number of veterans, and other items. 

a Plot the total number of farms versus the probabilities of selection w;. Does your 
plot indicate that unequal-probability sampling will be helpful here? 


b Estimate the total number of farms in the United States, along with its standard 
error. 


Use the file statepop.dat for this problem. 


a Plot the total number of veterans versus the probabilities of selection y;. Does 
your plot indicate that unequal-probability sampling will be helpful here? 


b Estimate the total number of veterans in the United States, and find the standard 
error for your estimate. 


c Estimate the total number of Vietnam veterans in the United States, and find the 
standard error for your estimate. 


In Example 6.11, we calculated the with-replacement variance for yy. In this exam- 
ple, the sampling fraction n/N is 1/3, so the with-replacement variance is likely to 
overestimate the without-replacement variance. The joint inclusion probabilities for 
the psus are given in file classppsjp.dat, and can also be obtained by running the SAS 
program given on the website. 


a Calculate Valier) and Vevelaar) for this data set. 


b SAS software approximates the without-replacement variance in unequal-probability 


sampling using 
(1 ) Vwr(tur) 
N R H ™ 


Calculate this approximation for the class data. 


ce How do these estimates compare, and how do they compare with the with- 
replacement variance calculated in Example 6.11? 
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C. Working with Theory 
All of the problems in this section require knowledge of probability. 


a_ Prove that Lahiri’s method results in a probability proportional to size sample 
with replacement. HINT: Let J be an integer with J > max{M;}. Let U;, U2,... be 
discrete uniform {1,..., NM} random variables, let V;, V2,... be discrete uniform 
{1,...,J} random variables, and assume all U; and V; are independent. To select 
the first psu, we generate pairs (U;, V;), (U2, V2), ...until Vj < My;- 

b Suppose the population has N psus, with sizes M;,Mo,..., My. Let X represent 


the number of pairs of random numbers that must be generated to obtain a sample 
of size n. Find E[X]. 


Note that the random variables Q;,...,Qy in Section 6.3 have a joint multinomial 
distribution with probabilities Wy, W,..., Wy. Use properties of the multinomial dis- 
tribution to show that cm in (6.13) is an unbiased estimator of t with variance given by 


N 


. i< tj . Aa 
vinr= 7 Dw( i ) Pee 7m (6.46) 


Also show that (6.14) is an unbiased estimator of the variance in (6.46). Hint: Use 
properties of conditional expectation in Appendix A, and write 


Viy) = V(Elty | Q1,.-., Owl) + E(VIty | Qi,.--, Qn). 


Show that (6.28) and (6.29) are equivalent when an SRS of psus is selected as in 
Chapter 5. Are they equal if psus are selected with unequal probabilities? 


Show that the formulas for stratified random sampling in (3.1), (3.3), and (3.5) fol- 
low from the formulas for the Horvitz-Thompson estimator in Section 6.4.3. For a 
stratified random sample, we sample from every stratum in the population. Thus, if 
we treat strata as if they were psus, 2; = | for every stratum in the population. 


Use the population in Exercise 2 of Chapter 4 for this exercise. Let w; be proportional 

to xj. 

a_ Using the draw-by-draw method illustrated in Example 6.8, calculate 7; for each 
unit and 7; for each pair of units, for a without-replacement sample of size two. 


b What is V(4y7)? How does it compare with the with-replacement variance using 
(6.46)? 


Covariance of estimated population totals in a cluster sample. Suppose a one-stage 
cluster sample is taken from a population of N psus, with inclusion probabilities z;. 

Let t, and t, be the population totals for response variables x and y, and let ¢;, and tj 
be the totals of variables x and y in psu i. 


a Show that 


Cov (ix, fy) = _ 


i=] me 


lix Tky 
es =F » > (sik = oe 
=o IU 
tA 
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b  Ifan SRS of n of the N psus is selected, with 2; = n/N and mx = (n/N)[(n — 1)/ 
(N — 1)], show using part (a) that 


Cov (é;, fy) ic (1 ~) ae ty 
Deh) = ixtiy — [7 | - 
ee (N —1)n N ON 


Comparing two domain means in a cluster sample. In Exercise 24 of Chapter 4, you 
showed that in an SRS where y, and y, estimate respective population domain means 
Yu, and Vy, VV} — V2) © VV) + VO2) because Cov (y,, y.) © 0. Now let’s explore 
what happens when a one-stage cluster sample is selected from a population of NV 
psus. For simplicity, assume that each psu has M ssus and that an SRS of n psus is 
selected. Let yy and Yo be the estimators of the domain means from the cluster sample. 
Similarly to Exercise 24 of Chapter 4, let xj; = 1 if ssuj of psu 7 is in domain | and 
x = 0 if ssuj of psu i is in domain 2, and let uj = xjyj. 

a Find 


Hint: Use Exercise 20. 


b Show that the covariance in (a) is 0 if for each psu, all of the elements in that psu 
belong to the same domain—that is, either ¢;, = 0 or tix =M for each psu i. [If 
the covariance in (a) is 0, then (4.26) implies that Cov (¥,, y,) ¥ 0.] 


ce Give an example in which the covariance in (a) is not 0. 


Indirect sampling. Suppose you want to take a sample of students in a university but 
your sampling frame is a list of all classes offered by the university. A student may be 
in more than one class, so a probability sample of classes, which includes all students 
in those classes, may contain some students multiple times. Lavallée (2007) describes 
a generalized weight share method for such situations, and this exercise is adapted 
from results in his book. 

Let U/4 be the sampling frame population with N units. Let Z; = 1 if unit 7 is in 
the sample S“ and 0 otherwise, with 2; = P(Z; = 1). The target population 1/7 has 
M elements. Each element in 2/? is linked with one or more of the units in 2/4; let 


— 1 ifelement & from U/? is linked to unit i from U/4 
i 0 otherwise 


and let Lk = pee £ix. We assume Ly > 1 for each k and that L; is known. In our 
example, ¢; = 1 if student k is in class i and L, is the number of classes taken by 
student k. Let y, be a characteristic associated with element k of 42. We want to 


; M 
estimate ty = )7y_1 Ye- 


a Letuj = 71, (Cixye/Lx) and let 


N 
‘ Z; 
h= ) — Uj. 
— i 
i=1 
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Show that iy is an unbiased estimator of ¢,, with 


N 


N WN 
v= > : — w+ > y> i add 


IU jIC; 
i=l i=l j#i ane 


b Let S? be the set of distinct units sampled from 1/2 using this procedure. Show 
that iy can be rewritten as 


i.2 
A 1 
t= — — Lig Vp. 
DP Aan 
ke SB i=l 
: NZ; tien eg: 9s 

We can view w; = ; bpaen Ali as a “weight” for yx. 
c If Z, = 1 forall k, show that fyr in (6.19) is a special case of Ls What is w; in 

this case? 


d SupposeZ/4 = {1,2,3},U/? = {1,2} and the values of €;, are given in the following 
table: 


Element k from U/? 


lin 1 2 
Unit 7 1 1 0 
from U/4 2 1 1 
3 0 1 


Suppose y; = 4 and y2 = 6, so that 4, = 10. Find the value of i for each of the 
three possible SRSs of size 2 from//“. Using the sampling distribution of i show 
that i is unbiased but that Vii) > 0. Even though each possible SRS from 1/4 
contains both units from 2/ (so in effect, a census is taken of 1/7), the variance 
of i is not zero. 


e Data file wtshare.dat contains information from a hypothetical SRS of size n = 
100 from a population of N = 40,000 adults. Each adult in the sample is asked 
about his or her children: how many children between ages 0 and 5, whether those 
children attend preschool, and how many other adults in the population claim 
the child as part of their household. Estimate the total number of children in the 
population who attend preschool. Use the with-replacement variance estimator 


1 1 x nu; ay 
nn—1l Tj ¥ 


ieSA 


to construct an approximate 95% CI for the total number of children who attend 
preschool. 


In simple random sampling, we know that a without-replacement sample of size n 
has smaller variance than a with-replacement sample of size n. The same result is not 
always true for unequal-probability sampling designs (Raj, 1968, p. 56). Consider a 
with-replacement design with selection probabilities y;, and a corresponding without- 
replacement design with inclusion probabilities 7;=nwyw;; assume ny; <1 for 
i=1,...,N. 


24 


25 


214 Chapter 6: Sampling with Unequal Probabilities 


a Consider a population with N = 4 and t) = —S, fp = 6, f3 = 0, and ty = —1. 
The joint inclusion probabilities for a without-replacement sample of size 2 are 
12 = 0.004, 143 = 193 = M4 = 0.123, m4 = 0.373, and 734 = 0.254. Find the 
value of zr; for each unit. Show that for this design and population, Vity) < V(tur). 


b Show that for z; = ny; and Viiy) in (6.8), 


ig tt \" 
A i k 


i=1 k=1 


ec Using V(iyr) in (6.21), show that if 
n—1 
Nik > ———m;m, for alliandk, 
n 


then V(it) < V(iy). 
d Gabler (1984) shows that if 


N 
: Wik 

y min { —)>n-1, 

= k TK 

= 


then V(fyr) < Viiy). Show that if 7, > (n — 1)a;7;,/n for alli and k, then 
Gabler’s condition is met. 


e (Requires knowledge of linear algebra.) Show that if V(tur) < Vitw), then 
n—-1 : 
Wik < 2——a;n, for alliandk. 
n 


Hint: Use the results in Theorem 6.1 to simplify Viiy) — V(tyr) so that it may 
be written as YU ae dixtjt,. Then A, the matrix with elements a;,, must be 
nonnegative definite and therefore all principal 2 x 2 submatrices must have 
determinant > 0. 


Consider a without-replacement sample of size 2 from a population of size 4, with 
joint inclusion probabilities 12 = 734 = 0.31, 713 = 0.20, m4 = 0.14, 723 = 0.03, 
and 724 = 0.01. 

a_ Calculate the inclusion probabilities zr; for this design. 


b Suppose ft, = 2.5, tf) = 2.0, t3 = 1.1, and t4 = 0.5. Find Vyr(far) and Vsyo(fur) 
for each possible sample. 


Brewer’s (1963, 1975) procedure for without-replacement unequal-probability sam- 
pling. For a sample of size n = 2, let 2; be the desired probability of inclusion for 
psu 7, with the usual constraint that poe mw; =n. Let w; = 7;/2 and 


wil — Wi) 


= 

Draw the first psu with probability a;/ ee | & Of selecting psu i. Supposing psu 7 is 
selected at the first draw, select the second psu from the remaining N — | psus with 
probabilities w;/(1 — yy). 
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a Show that 


ne Ya ( 1, 
’ 1 =; 1 — Tj , 


N 
D4 
k=1 


b Show that P(psu i selected in sample) = 7;. HINT: First show that 


k=1 k=1 


c The SYG estimator of V(ty7) for one-stage sampling is given in (6.23). Show that 
7;7; — 1; = 0 for Brewer’s method, so that the SYG estimator of the variance is 
always nonnegative. 


The following table gives population values for a small population of clusters: 


psu, i M; Values, yi t 
1 2B} 3,5, 4, 6, 2 20 
2 4 7,4, 7,7 25 
3 8 7, 2,9, 4,5, 3, 2,6 38 
4 5 2,5, 3, 6, 8 24 
5 3 9.75.5 21 


You wish to select two psus without replacement with probabilities of inclusion pro- 
portional to M;. Using Brewer’s method from Exercise 25, construct a table of mj 
for the possible samples. What is the variance of the one-stage Horvitz-Thompson 
estimator? 


Rao (1963) discusses the following rejective method for selecting a pps sample with- 

out replacement: Select n psus with probabilities y; and with replacement. If any psu 

appears more than once in the sample, reject the whole sample and select another n 

psus with replacement. Repeat until you obtain a sample of n psus with no duplicates. 
Find xj and 7; for this procedure, for n = 2. 


The Rao-Hartley-Cochran (1962) method for selecting psus with unequal probabili- 
ties. To take a sample of size n, divide the population into n random groups of psus, 
U,, U2, ..., Un. Then select one psu from each group with probability proportional 
to size. Let N; be the number of psus in group k. If psu i is in group &, it is selected 
with probability x,; = M;/ Doe u, Mj; the estimator is 


Show that fgyc is unbiased for f, and find its variance. HINT: Use two sets of indicator 
variables. Let [,; = 1 if psu i is in group k and 0 otherwise, and let Z; = 1| if psu i is 
selected to be in the sample. 


The estimators of V (fr) in (6.22) and (6.23) require knowledge of the joint inclusion 
probabilities 7;,. To use these formulas, the data file must contain an n x n matrix 
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of the zx’s, which can dramatically increase the size of the data file; in addition, 
computing the variance estimator is complicated. If the joint inclusion probabilities 
mj could be approximated as a function of the z;’s, estimation would be simplified. 
Let c; = m;(1 — 7;). Hajek (1964) (see Berger, 2004, for extensions) suggested 
approximating 77, by 


N 
ite = mim | 1— (1 — 2) — m)/ Dog 


jel 
a Does the set of 7x’s satisfy condition (6.18)? Can they be joint inclusion proba- 
bilities? 
b What is 77 if an SRS is taken? Show that if N is large, 7, is close to jx. 


ce Show that if 7, is substituted for zz, in (6.21), the expression for the variance can 
be written as 


N 

» 2 

Vissi (fur) = D> cie?, 
i=l 


where e; = t;/7; — A and 


HINT: Write (6.21) as 


d We can estimate Via; (tar) by 


Vita) (tur) = 2 Ge, 
icS 
zB ~ ~ ~ fj 7 
where @ = (1 — m))n/(n — 1), ; = f;/m; — A, and A = a+ /> z. Show 
JES J jes 
that if an SRS of size n is taken, then Viti (tur) =N*(1— n/N)s?/n. 
This exercise is based on results in Brewer and Donadio (2003). 


a_ Show, using the results in Theorem 6.1, that the variance in (6.21) can be rewritten 
as: 


= tj Tt ss al tj t . 
V(t = 7 a 24(—-—- 
(tur) * (< “) Pe (< ~) 
cae Z t t t 
i k 
Tix — Wj, : 6.47 
+22! k o(é -) (4 -) (6.47) 
ksi 
HINT: Write t;/7; — t./m, = t;)/m; —t/n+t/n — t/t. 
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b The first term in (6.47) is the variance that would result if a with-replacement 
sample with selection probabilities w; = 2;/n were taken. Brewer and Donadio 
(2003) suggest that the second term may be viewed as a finite population correction 
for unequal-probability sampling, so that the first two terms in (6.47) approximate 
V(iyr) without depending on the joint inclusion probabilities 2. Calculate the 
three terms in (6.47) for an SRS of size n. 


ce Suppose that there exist constants c; such that 7, © m;7;(c; + c,)/2. Show that 
with this substitution, the third term in (6.47) can be approximated by 


N 2 
4 i t 

> 1; (1 — cj) (4 = *) 

i=l My oh 


so that 


t 


2 
*) ; (6.48) 


lj 
Tj n 


N 
V(tur) © Ere — Cj) ( 
i=1 


Two choices suggested for c; are c; = (n — 1)/(n — 7;) or (following Hartley and 
Rao, 1962), 
n—1 


(1 —2n;+ 1 es) m7) 


Calculate the variance approximation in (6.48) for an SRS with each of these 
choices of c;. 


ci = 


(Requires calculus.) Suppose in (6.46) that the variance of the estimator of the total in 
psu iis V(z;) = M?S? /mj. If you can only subsample C = }~_, m; ssus, what values 
of m; minimize (6.46)? 


Consider the Mitofsky—Waksberg method, discussed in Example 6.12. Show that the 
probability that psu 7 is selected as the first psu in the sample is 


P(select psu i) ut 
select psu i) = —. 

Pp Mo 
Hint: See Exercise 15 and argue that the Mitofsky—Waksberg method for selecting 
psus is a special case of Lahiri’s method. 


In Example 6.12 and Exercise 32, it was shown that the Mitofsky-Waksberg method 
produces a self-weighting sample if any psu in the sample has at least k residential 
telephone numbers. Suppose a psu in the sample has x < k residential numbers. What 
is the relative weight for a telephone number in that psu? 


One drawback of the Mitofsky—Waksberg method as described in Example 6.12 is 
that the sequential sampling procedure of selecting numbers in the psu until one has 
a total of k residential numbers can be cumbersome to implement. Suppose in the 
second stage you dial an additional (k — 1) numbers whether they are residential or 
not, and let x be the number of residential lines among the (kK — 1). What are the 
relative weights for the residential telephone numbers? 
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The Mitofsky—Waksberg method, described in Example 6.12, gives a self-weighting 
sample of telephone numbers under ideal circumstances. Does it give a self-weighting 
sample of adults? Why or why not? If not, what relative weights should be used? 


Suppose a three-stage cluster sample is taken from a population with N psus, M; ssus 
in the ith psu, and L; tsus (tertiary sampling units) in the jth ssu of the ith psu. To 
draw the sample, n psus are randomly selected, then m; ssus from the selected psus, 
then /;; tsus from the selected ssus. 

a Show that the sample weights are 


N M; Li 
Wijk = -—T7 
nm, li 
N M Li 
b Let?= >) >) >> winyie. Show that Eff] =t= >>> > yin. 
ieS jeS; keSij i=l j=l k=1 


ce Using the properties of conditional expectation in Section A.4, find an expression 
for V(?). 


(Model-based.) Suppose the entire population is observed in the sample, so thatn = N 
and m; = M;. Examine the three estimators Tynp, Tratio (from Section 5.6) and Tp 
(from Section 6.7). If the entire population is observed, which of these estimators 


equal T= NOM, ¥,2 
D. Projects and Activities 


Rectangles. Use the population of rectangles in Exercise 30 of Chapter 2 for the 

exercise. The file rectlength.dat contains information on the vertical length of each of 

the 100 rectangles in the population. 

a_ Select a sample of 10 rectangles with replacement from the 100 rectangles, with 
probability proportional to the length of the rectangle. 

b_ For your sample, plot ¢;, the area of the rectangle, vs. the selection probability y;. 
What is the correlation between ¢; and w;? 

ec Estimate the total area of all 100 rectangles, and find a 95% confidence interval for 
the total area. Compare your answers with the estimate and confidence interval 
from the SRS in Exercise 30 of Chapter 2. Did unequal-probability sampling 
result in a smaller variance estimate? 


Repeat Exercise 38(a), using a without-replacement sample of 10 rectangles selected 

with probability proportional to the length of the rectangle. You will need to use a 

program such as SAS PROC SURVEYSELECT to select the sample. 

a What are the inclusion probabilities ; for the rectangles in your sample? 

b Estimate the total area of all 100 rectangles using the Horvitz—Thompson estimator 
tyr. 

ce Find the with-replacement variance estimate for for. 

d (Requires knowledge of the joint inclusion probabilities.) Find the SYG variance 
estimate for fr. How does this compare with the estimate in (b)? 


Historians wanting to use data from United States Censuses collected in the pre- 
computer age faced the daunting task of poring over reels of handwritten records 
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on microfilm, arranged in geographical order. The Public Use Microdata Samples 
(PUMS) were constructed by taking samples of the records and typing those records 
into the computer. Ruggles (1995, p. 44) described the PUMS construction for the 
1940 Census: 


The population schedules of the 1940 census are preserved on 4,576 microfilm reels. 
Each census page contains information on forty individuals. Two lines on each page 
were designated as “sample lines” by the Census Bureau: the individuals falling on 
those lines—S percent of the population—were asked a set of supplemental questions 
that appear at the bottom of the census page. 

Two of every five census pages were systematically selected for examination. On 
each selected census page, one of the two designated sample lines was then randomly 
selected. Data-entry personnel then counted the size of the sample unit containing the 
targeted sample line. Units size six or smaller were included in the sample in inverse 
proportion to their size. Thus, every one-person unit was included in the sample, every 
second two-person unit, every third three-person unit, and so on. Units with seven or 
more persons were included with a probability of 1-in-7: every seventh household of 
size seven or more was selected for the sample. 


a_ Explain why this is a cluster sample. What are the psus? The ssus? 


b What effect do you think the clustering will have on estimates of race? age? 
occupation? 


ec Construct a table for the inclusion probabilities for persons in one-person units, 
two-person units, and so on. 


d What happens if you estimate the mean age of the population by the average age 
of all persons in the sample? What estimator should you use? 


e Do you think that taking a systematic sample was a good idea for this sample? 
Why or why not? 


f Does this method provide arepresentative sample of households? Why or why not? 


g What type of sample is taken of the individuals with supplementary information? 
Explain. 


Ruggles (1995, p. 45) also describes the 1950 PUMS: 


The 1950 census schedules are contained on 6,278 microfilm reels. Each census page 
contains information on thirty individuals. Every fifth line on the census page was 
designated as a sample line, and additional questions for the sample-line individuals 
appear at the bottom of the form. For the last sample-line individual on each page, there 
was a block of additional supplemental questions. Thus, 20 percent of individuals were 
asked a basic set of supplemental questions, and 3.33 percent of individuals were asked 
a full set of supplemental questions. 

One-in-eleven pages within enumeration districts was selected randomly. On each 
selected census page, the sixth sample-line individual (the one with the full set of 
questions) was selected for inclusion in the sample. Any other members of the sample 
unit containing the selected individual were also included. 


Answer the same questions from Exercise 40 for the 1950 PUMS. 


In Exercise 35 of Chapter 2, you estimated the size of an audience by taking an SRS. 
Explain how this is a special case of cluster sampling. Obtain a seating chart for an 
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auditorium in which the rows have different numbers of seats. Using the seating chart, 
select an unequal-probability sample of 10 or 20 rows, with probabilities proportional 
to the numbers of seats in each row. Why might you expect the unequal-probability 
sample to have a smaller variance for the estimated audience size than an SRS of the 
same number of rows? 

Estimate the audience size for this auditorium using your unequal-probability 
sample; count the number of people in each selected row. Give a 95% CI for the total 
number of people in the auditorium. 


Create your own stock market index fund. The data file sp500.dat contains a listing 
of the stocks in the S&P 500 Index, along with the market capitalization of each 
company, as of April 2006. The market capitalization of a company is the market 
value of its outstanding shares, calculated as (price per share) x (number of shares 
outstanding). 

There are several ways you could own a self-weighting sample of dollars repre- 
sented by all the companies in this index. You could take an SRS of the individual 
dollars in the stock market, buying shares in each company for which you have at 
least one dollar in your SRS. This can be cumbersome, however, and would mean 
buying shares of a large (and random) number of companies. 

An easier way is to take a sample of companies with probability proportional to 
market capitalization. Suppose you have $300,000 to invest. Select a sample of 30 
companies from the list of 498 companies in the file with probability proportional 
to market capitalization. Create a file of the companies in your sample; for each 
company, state how much money you will invest in that company so that you have a 
self-weighting sample of dollars in the index. 


Baseball data. 


a___ Use the population in the file baseball.dat to take a two-stage cluster sample (with- 
out replacement) with the teams as the psus, with probabilities proportional to the 
total number of runs scored for the teams. Your sample should have approximately 
150 players altogether, as in the SRS from Exercise 32 of Chapter 2. Describe 
how you selected your sample. 

b Construct the sampling weights for your sample. 

c Let 7; be the estimated total of the variable /ogsal for team i in your sample, and 
let 27; be the inclusion probability for team 7. Plot t; VS. Uj. 

d Use your sample to estimate the mean of the variable logsal, and give a 95% CI. 

e Estimate the proportion of players in the data set who are pitchers, and give a 95% 
CL. 

f Do you think that unequal-probability sampling resulted in more efficiency for 
your estimators? Why, or why not? 


IPUMS exercises. Exercise 37 of Chapter 2 described the IPUMS data. 

a Select an unequal-probability sample of 10 psus, with probability proportional to 
number of persons. Take a subsample of 20 persons in each of the selected psus. 

b Using the sample you selected, estimate the population mean and total of inctot 
and give the standard errors of your estimates. 


Complex Surveys 


There is no more effective medicine to apply to feverish public sentiment than figures. To be sure, they 
must be properly prepared, must cover the case, not confine themselves to a quarter of it, and they must 
be gathered for their own sake, not for the sake of a theory. Such preparation we get in a national census. 


—lda Tarbell, The Ways of Woman (1915) 


Most large surveys involve several of the ideas we have discussed: A survey may be 
stratified with several stages of clustering and rely on ratio and regression estimation 
to adjust for other variables. The formulas for estimating standard errors can become 
complicated, especially if there are several stages of clustering without replacement. 
Sampling weights and design effects are commonly used in complex surveys to sim- 
plify matters. These, and plots for complex survey data, are discussed in this chapter. 
The chapter concludes with a description of the National Crime Victimization Survey 
(NCVS) design, and with parallels between survey samples and designed experiments. 


1] 


Assembling Design Components 


We have seen most of the components of a complex survey: random sampling, ratio 
estimation, stratification, and clustering. Now, let’s see how to assemble them into one 
sampling design. Although in practice weights (Section 7.2) are often used to find 
point estimates and computer-intensive methods (Chapter 9) are used to calculate 
variances of the estimates, understanding the basic principles of how the components 
work together is important. Here are the concepts you already know, in a modular 
form ready for assembly. 


11.1 Building Blocks for Surveys 


1 Cluster sampling with replacement. Select a sample of n clusters with replacement; 
primary sampling unit (psu) 7 is selected with probability y; on a draw. Estimate 
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the total for psu i using an unbiased estimator 7. Then treat the n values (the sample 
is with replacement, so some of the values in the set may be from the same psus) 
of u; =t;/ 1; as observations: Estimate the population total by a, and estimate the 
variance of the estimated total by s?/n. 


2 Cluster sampling without replacement. Select a sample of n psus without replace- 


ment; zr; is the probability that psu i is included in the sample. Estimate the total 
for psu i using an unbiased estimator #;, and calculate an unbiased estimator of the 
variance of #;, V(7;). Then estimate the population total with the Horvitz-Thompson 
estimator! from (6.19): 
‘ ti 
tyT = —% 
ne ei 
icS 
Use an exact formula from Chapters 5 or 6 or a method from Chapter 9 to estimate 
the variance. We often estimate the variance assuming that psus were selected with 
replacement, as discussed in Section 6.4. 


3. Stratification. Let t|,...,f4 be unbiased estimators of the stratum totals t),..., ty, 
and let V(f,),..., V(fy) be unbiased estimators of the variances. Then estimate 
the population total by 


H 
i= Di 
and its variance by 


V@) => VG). 
h=1 

Stratification usually forms the coarsest classification: Strata may be, for exam- 
ple, areas of the country, different area codes, or types of habitat. Clusters (sometimes 
several stages of clusters) are sampled from each stratum in the design, and additional 
stratification may occur within clusters. Many surveys have a stratified multistage sur- 
vey design, in which a stratified sample is taken of psus, and subsamples of secondary 
sampling units (ssus), are selected within each selected psu. With several stages of 
clustering and stratification, it helps to draw a diagram or construct a table of the 
survey design, as illustrated in the following example. 


Malaria has long been a serious health problem in the Gambia. Malaria morbidity 
can be reduced by using bed nets that are impregnated with insecticide, but this is 
only effective if the bed nets are in widespread use. In 1991, a nationwide survey 
was designed to estimate the prevalence of bed net use in rural areas. The survey is 
described and results reported in D’ Alessandro et al. (1994). 

The sampling frame consisted of all rural villages of fewer than 3000 people in 
the Gambia. The villages were stratified by three geographic regions (eastern, central, 


' Recall that the Horvitz-Thompson estimator encompasses the other without-replacement, unbiased 
estimators of the total as special cases, as discussed in Section 6.4.4. 
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and western) and by whether the village had a public health clinic (PHC) or not. In 
each region five districts were chosen with probability proportional to the district 
population as estimated in the 1983 national census. In each district four villages 
were chosen, again with probability proportional to census population: two PHC 
villages and two non-PHC villages. Finally, six compounds were chosen more or less 
randomly from each village, and a researcher recorded the number of beds and nets, 
along with other information, for each compound. 
In summary, the sample design is the following: 


Stage Sampling Unit Stratification 


1 District Region 
Village PHC/non-PHC 
3 Compound 


To calculate estimates or standard errors using formulas from the previous chapters, 
you would start at Stage 3 and work up. The following are steps you would follow to 
estimate the total number of bed nets (without using ratio estimation): 


1 Record the total number of nets for each compound. 


2 Estimate the total number of nets for each village by (number of compounds in the 
village) x (average number of nets per compound). Find the estimated variance 
of the total number of nets, for each village. 


3 Estimate the total number of nets for the PHC villages in each district. Villages 
were sampled from the district with probabilities proportional to population, so 
formulas from Chapter 6 need to be used to estimate the total and the variance of 
the estimated total. Repeat for the non-PHC villages in each district. 


4 Add the estimates from the two strata (PHC and non-PHC) to estimate the number 
of nets in each district; sum the estimated variances from the two strata to estimate 
the variance for the district. 


5 At this point you have the estimated total number of nets and its estimated variance, 
for each district. Now use two-stage cluster sampling formulas to estimate the total 
number of nets for each region. 


6 Finally, add the estimated totals for each region to estimate the total number 
of bed nets in the Gambia. Add the region variances as called for in stratified 
sampling. 


Sounds a little complicated, doesn’t it? And we have not even included ratio 
estimation, which would almost certainly be incorporated here because we know 
approximate population numbers for the numbers of beds at each stage. Fortunately, 
we do not always have to go to this much work in complex surveys. As we shall see 
later in this chapter and in Chapter 9, we can use sampling weights and computer- 
intensive methods to avoid much of this effort. Using a with-replacement variance 
estimator allows us to estimate a variance using only the weighting, stratification, and 
psu information. us 
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Ratio estimation may be used at almost any level of the survey, although it is usually 
used near the top. We discussed ratio estimation with stratified random sampling in 
Section 4.5. The principles are the same for any probability sampling design used 
within the strata in a stratified multistage sample. Suppose that the population total 
t, is known for an auxiliary variable x, and that iy and ?, are unbiased estimators 
for ft, and f,, respectively, from the sample. The combined ratio estimator of the 
population total for variable y is 


tyre = Bt, 
where 
ba 2: 
te 


in Section 9.1 we show that the mean squared error (MSE) of bis can be estimated by 


2 
Res ox t rae RGR kh: Rees posix: 
V Gyre) = (=) ["a) + BV (7,) — 2BGov (,2,)| 
x 
The separate ratio estimator applies ratio estimation within each stratum first, 
then combines the strata: 


H H ; 
P ~ Wh 
tyrs — ) tyhr = ) xh a> 
h=l hal A 


with 


H 
AG) a > Vinr)- 
h=1 

As we saw in Section 5.2.3, we often use ratio estimation for estimating means, 
letting the auxiliary variable x; be 1 if unit i is in the sample and 0 otherwise. Then /, 
estimates the population size, and the ratio B= ty /t, divides the estimated population 
total by the estimated population size. 

Other ratios are often of interest as well. One quantity of interest in the bed net 
survey was the proportion of beds that have nets. In this case, x refers to beds and y 
refers to nets. Then, t, is the total number of beds in the population and ¢, is the total 
number of bed nets in the population. We estimate the proportion of beds that have 
nets by B= ty /t,. Alternatively, the ratio can be estimated separately for each region 
if it is desired to compare the bed net coverage for the regions. 


113 Simplicity in Survey Design 


All these design components have been shown to increase efficiency in survey after 
survey. Sometimes, though, an inexperienced survey designer is tempted to use a 
complex sampling design simply because it is there or has been used in the past, not 
because it has been demonstrated to be more efficient. Make sure you know from 
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pretests or previous research that a complex design really is more efficient and practi- 
cal. A simpler design giving the same amount of information per dollar spent is almost 
always to be preferred to a more complicated design: It is often easier to administer 
and analyze, and data from the survey are less likely to be analyzed incorrectly by 
subsequent analysts. A complex design should be efficient for estimating all quanti- 
ties of primary interest—an optimal allocation in stratified sampling for estimating 
the total amount U.S. businesses spend on health care benefits may be very inefficient 
for estimating the percentage of businesses that declare bankruptcy in a year. 


he 
Sampling Weights 
121 Constructing Sampling Weights 


In most large sample surveys, weights are used to calculate point estimates. We have 
already seen how sampling weights are used in stratified sampling and in cluster 
sampling. In without-replacement sampling, the sampling weight for an observation 
unit is always the reciprocal of the probability that the observation unit is included in 
the sample. 

Recall that for stratified random sampling, 


H 


Fete = > > WhjYhj> 


h=1 jeSp, 


where the sampling weight wy; = (N;,/n;) can be thought of as the number of obser- 
vations in the population represented by the sample observation y,;. The probability 
of selecting the jth unit in the Ath stratum to be in the sample is 7; =n,;,/Nj, so the 
sampling weight is simply the inverse of the probability of selection: wy; = 1/7. 

The sum of the sampling weights in stratified random sampling equals the popula- 
tion size NV; each sampled unit “represents” certain number of units in the population, 
so the whole sample “represents” the whole population. The stratified sampling esti- 
mator of yy is 


L 
YE enn 


h=1 jeSp, 


ee 


h=1 jeS;, 


Yer = 


The same forms of the estimators were used in cluster sampling in Section 5.3, 
and in the general form of weighted estimators in Section 6.4.4. In cluster sampling 
with equal probabilities, for example, 

NM; _ 1 
nm; probability that the jth ssu in the ith psu is in the sample’ 


Wii = 
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Again, 


P= wid 


ieS jES; 


and the estimator of the population mean is 


A 


t 


Louw 


icS jES; 


For cluster sampling with unequal probabilities, when 7; is the probability that the ith 
psu is in the sample, and 77; is the probability that the jth ssu is in the sample given 
that the ith psu is in the sample, the sampling weights are wj = 1 /(z777)):). 

For three-stage cluster sampling, the principle extends: Let w, be the weight for 
the psu, w,|, be the weight for the ssu, and w,,,, be the weight associated with the tsu 
(tertiary sampling unit). Then the overall sampling weight for an observation unit is 


W = Wp X Wsip X Wils,p- 


All the information needed to construct point estimates is contained in the sam- 
pling weights; when computing point estimates, the sometimes cumbersome proba- 
bilities with which psus, ssus, and tsus are selected appear only through the weights. 
But the sampling weights give no information on how to find standard errors of the 
estimates, and thus knowing the sampling weights alone will not allow you to do 
inferential statistics. Variances of estimates depend on the probabilities that any pair 
of observation units is selected to be in the sample, and requires more knowledge of 
the sampling design than given by weights alone. 

Very large weights are often truncated or smoothed, so that no single observation 
has a very large contribution to the overall estimate. While this biases the estimators, it 
can reduce the MSE (Elliott and Little, 2000). Truncation is often used when weights 
are used to adjust for nonresponse, as described in Chapter 8. 

Since we consider stratified multistage designs in the remainder of this book, 
from now on we will adopt a unified notation for estimators of population totals. We 
consider y; to be a measurement on observation unit i, and w; to be the sampling 
weight of observation unit 7. Thus, for a stratified random sample, y; is an observation 
unit within a particular stratum, and w; = N;,/n,, where unit i is in stratum h. This 
allows us to write the general estimator of the population total as 


i, = Ss, Wiis (7.1) 
icS 
where all measurements are at the observation unit level. The general estimator of the 
population mean is 


A 


ty 


a 


icS 


y= (7.2) 


ies Wi estimates the number of observation units in the population. 
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The Gambia bed net survey in Example 7.1 was designed so that within each region 
each compound would have almost the same probability of being included in the 
survey; probabilities varied only because different districts had different numbers of 
persons in PHC villages and because number of compounds might not always be 
exactly proportional to village population. For the central region PHC villages, for 
example, the probability that a given compound would be included in the survey was 


P(district selected) x P(village selected | district selected) 


x P(compound selected | district and village selected) 


where 
C =number of compounds in the village 
V =number of people in the village 
D1=number of people in the district 
D2 = number of people in the district in PHC villages 
R =number of people in PHC villages in all central districts 


Since the number of compounds in a village will be roughly proportional to the 
number of people in a village, V/C should be approximately the same for all com- 
pounds. The value of R is also the same for all compounds within a region. The weights 
for each region, the reciprocals of the inclusion probabilities, differ largely because 
of the variability in D1/D2. As R varies from stratum to stratum, though, compounds 
in more populous strata have higher weights than those in less populous strata. = 


pe Self-Weighting and Non-Self-Weighting Samples 


Sampling weights for all observation units are equal in self-weighting surveys. Self- 
weighting samples can, in the absence of nonsampling errors, be considered repre- 
sentative of the population because each observed unit represents the same number of 
unobserved units in the population. Standard statistical methods may then be applied 
to the sample to obtain point estimates. A histogram of the sample values displays the 
approximate frequencies of occurrence in the population; the sample mean, median, 
and other sample statistics estimate the corresponding population quantities. In addi- 
tion, self-weighting samples often yield smaller variances, and sample statistics are 
more robust (Kish, 1992). 

Most large self-weighting samples used in practice are not simple random samples 
(SRSs), however. Stratification is used to reduce variances and obtain separate esti- 
mates for domains of interest; clustering, often with unequal probabilities, is used to 
reduce costs. You need to use statistical software that is specifically designed for survey 
data to obtain valid statistics from complex survey data. If you instead use statistical 
software that is intended for data fulfilling the usual statistical assumption that obser- 
vations are independent and identically distributed, the standard errors, hypothesis test 
results, and confidence intervals (CIs) produced by the software will be wrong. If the 
sample is not self-weighting, estimates of means and percentiles produced by standard 
statistical software will also be biased. When you read a paper or book in which the 
authors analyze data from a complex survey, see whether they accounted for the data 
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structure in the analysis, or whether they simply ran the raw data through a non-survey 
statistical package procedure and reported the results. If the latter, their inferential 
results must be viewed with suspicion; it is possible that they only find statistical 
significance because they fail to account for the survey design in the standard errors. 

Many surveys, of course, purposely sample observation units with different prob- 
abilities. The disproportionate sampling probabilities often occur in the stratification: 
a higher sampling fraction is used for a stratum of large business establishments than 
for a stratum of small business establishments. The United States National Health 
and Nutrition Examination Survey (NHANES) purposely oversamples areas contain- 
ing large black and Mexican-American populations (Ezzati-Rice and Murphy, 1995; 
National Center for Health Statisitcs, 2005); oversampling these populations allows 
comparison of the health of racial and ethnic minorities. 


123 Weights and a Model-Based Analysis of Survey Data 


13 


Estimating 


You might think that a statistician taking a model-based perspective could ignore the 
weights altogether. After all, to a model-based survey statistician, the sample design 
is irrelevant and the important part of the analysis is finding a model that summarizes 
the population structure; as sampling weights are functions of the probabilities of 
selection in the design, perhaps they too are irrelevant. 

But the model-based and randomization-based approaches are not as far apart as 
some of the literature debating the issue would have you believe. Remember, a statis- 
tician designing a survey to be analyzed using weights implicitly visualizes a model 
for the data; NHANES is stratified and subpopulations oversampled precisely because 
researchers believe there will be a difference among the subpopulations. Such differ- 
ences also need to be included in the model. If you ignore the weights in analyzing data 
from NHANES, for example, you implicitly assume that whites, blacks, and Mexican 
Americans are largely interchangeable in health status. Ignoring the clustering in the 
inference assumes that observations in the same cluster are uncorrelated, which is 
not generally true. A data analyst who ignores stratification variables and dependence 
among observations is not fitting a good model to the data but is simply being lazy. A 
good analysis of survey data using models is difficult, and requires extensive validation 
of the model. The books edited by Skinner, Holt, and Smith (1989a) and Chambers 
and Skinner (2003) contain several chapters on modeling data from complex surveys. 

Many researchers have found that sampling weights contain information that can 
be used in a model-based analysis. Little (1991) develops a class of models that result 
in estimators that behave like estimators obtained using survey weights. Pfeffermann 
(1993, 1996) describes a framework for deciding on whether to use sampling weights 
in regression models of survey data. Thompson (1997) and Binder and Roberts (2003) 
discuss differences between model-based and design based inference. 


a Distribution Function 


So far, we have concentrated on estimating population means, totals, and ratios. 
Historically, sampling theory was developed primarily to find these basic statistics 
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and to answer questions such as “What percentage of adult males are unemployed?” 
or “What is the total amount of money spent on health care in the United States?” or 
“What is the ratio of the numbers of exotic to native birds in an area?” 

But statistics other than means or totals may be of interest. You may want to 
estimate the median income in Canada, find the 95th percentile of test scores, or 
construct a histogram to show the distribution of fish lengths. An insurance company 
may set reimbursements for a medical procedure using the 75th percentile of charges 
for the procedure. We can estimate any of these quantities (but not their standard 
errors, however) with sampling weights. The sampling weights allow us to construct 
an empirical distribution for the population. 

Suppose the values for the entire population of N units are known. Then any 
quantity of interest may be calculated from the probability mass function, 


number of units whose value is y 


fO)= a 


or the cumulative distribution function (cdf) 


number of units with value < y S 
SDS X)s 


F(y)= WV 


xSy 


In probability theory, these are the probability mass function and cdf for the random 
variable Y, where Y is the value obtained from a random sample of size one from 
the population. Then f(y) = P{Y = y} and F(Qy)=P{Y < y}. Of course, }°f(y)= 
F(oo)=1. 

Any population quantity can be calculated from the probability mass function or 
cdf. The population mean is 


w= >) fo). 
values of y 


in population 


The population variance, too, can be written using the probability mass function: 


ee ee 
N-1 = 


2 
= 7 LO E ~ ea) 
N : 
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If the cdf F were continuous, we would define the population median to be the 
value m satisfying F(m)= 1/2. Because F has jumps at the values of y in the popu- 
lation, however, it is possible that the function F'(y) does not attain the value 1/2. We 
define the finite population median to be the value m satisfying F(m) = 1/2 if such 
a value exists; otherwise, a population median is any value in the interval [, m2], 
where mz, is the largest value of y in the population with F(y) < 1/2 and mz is the 
smallest value of y with F(y) > 1/2. In general, 0, is a 100gth quantile (percentile) if 
F(@,) = q if such a value exists; otherwise, 6, € [a, b] where ais the largest population 
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FIGURE 7.1 
The function F(y) for the population of heights. 
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value of y with F(y) < g and b is the smallest value of y with F(y) > qg. If gq < 1/N, 
0, is the smallest value of y and if g > 1 — 1/N, 6, is the largest value of y. 


Consider an artificial population of 1000 men and 1000 women in file htpop.dat. Each 
person’s height is measured to the nearest centimeter (cm). The frequency table in 
file htcdf.dat gives the probability mass function and cdf for the 2000 persons in the 
population. Figures 7.1 and 7.2 show the graphs of F(y) and f(y). The population 
mean is Vy = > yf(y) = 168.6. 

Now let’s take an SRS of size 200 from the population (file htsrs.dat). An SRS is 
self-weighting; each person in the sample represents w; = 10 persons in the popula- 
tion. Hence, the histogram of the sample should resemble f(y) from the population; 
Figure 7.3 shows that it does. 

But suppose a stratified sample of 160 women and 40 men (file htstrat.dat) is taken 
instead of a self-weighting sample. In the stratified sample, each woman has weight 
1000/160 = 6.25 and each man has weight 1000/40 = 25. A histogram of the raw 
data will distort the population distribution, as illustrated in Figure 7.4. The sample 
mean and median are too low because men are underrepresented in the sample. m= 


Sampling weights allow us to construct empirical probability mass and cdfs for 
the data. Any statistics can then be calculated. Define the empirical probability mass 
function (epmf) to be the sum of the weights for all observations taking on the value y, 
divided by the sum of all the weights: 


The empirical cumulative distribution function (empirical cdf) F (y) is the sum of 
all weights for observations with values < y, divided by the sum of all weights: 


FQ) = )of@). 


xSy 
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FIGURE 7.2 
The function f(y) for the population of heights. 
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FIGURE 7.3 
A histogram of raw data from an SRS of size 200. The general shape is similar to that of f(y) 
for the population because the sample is self-weighting. 
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FIGURE 7.4 
A histogram of raw data (not using weights) from a stratified sample of 160 women and 
40 men. Tall persons are underrepresented in the sample, so this histogram distorts the popu- 
lation distribution. 
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FIGURE 7.5 
The estimate f(y) for the stratified sample of 160 women and 40 men. 
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For a self-weighting sample, 7 (y) reduces to the relative frequency of y in the sample. 
For a non-self-weighting sample, ri (y) and F (y) are attempts to reconstruct the popu- 
lation functions f and F from the sample. The weight w; is the number of population 
units represented by unit 7, so > w; estimates the total number of units in the 
ic€S:yj=y 

population that have value y. If all weights are integers, we can view F (y) as the 
cdf of a “pseudo-population” constructed by repeating observation y; w; times (see 
Exercise 6). Consider a probability sample of size 3 from a population of size 10, 
with sampled values given in the following table. 


Sample value y; 


| 4 6 7 

Weight w; | 2 3 5 

Using the weights, we can construct a pseudo-population with values {4, 4, 6, 6, 6, 7, 
7,7,7,7}; each value of y; is replicated w; times. This is not the true population, 
of course, but it represents an attempt to estimate the population from the sample. 
For this sample of size 3, F(4) = 2/10, F(6)=5/10, and F(7) = 1. In most surveys, 
weights are not integers and the population size is too large to permit constructing a 
pseudo-population, but it is sometimes helpful to think of F(y) as an estimator of the 
population cdf F(y). 


Each woman in the stratified sample in Example 7.3 has sampling weight 6.25; each 
man has sampling weight 25. The empirical probability mass function from the strat- 
ified sample is in Figure 7.5. The weights correct the underrepresentation of taller 
people found in the histogram in Figure 7.4. The scarcity of men in the sample, how- 
ever, demands a price: The right tail of. fi (y) has a few spikes of size 0.0125 = 25/2000, 
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each spike coming from one man in the stratified sample, rather than a number of 
values tapering off. = 


The epmf f (y) can be used to find estimates of population quantities. First express 
the population characteristic in terms of f(y): ¥y = >> yf(y) or 


7 2 
= | Dro) {»- rv} =~ | Doro) - [D0] 
y x y y 


Then, substitute (y) for every appearance of f(y) to obtain an estimate of the popu- 
lation characteristic. Using this method, then, 
wii 


Be o = icS 
y= Lo) in 
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and 
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Population quantiles are estimated similarly. Recall that 6, is a 100gth quantile 
if @, € [a,b] where a is the largest population value of y with F(y) < qg and b is the 
smallest value of y with F(y) > qg. Since the empirical cdf Fisa step function, we 
usually interpolate to find a unique value for the quantile. Let y; be the largest value 
in the sample for which F (v1) < qg and let yz be the smallest value in the sample for 
which F(y2) > qg. Then 


> FO) 
F(y2) — FV) 

Table 7.1 shows the difference in the estimates when weights for the stratified 
sample are incorporated through the function 7 (y). The statistics calculated using 
weights are much closer to the population quantities. We note that estimators calcu- 
lated using this method are not necessarily unbiased, however, or numerically stable. 
In particular, the estimator of S? in (7.3) is sensitive to roundoff error and in practice 
a different estimator such as those studied by Courbois and Urquhart (2004) may be 
preferable. 

This simple example only involved stratification, but the method is the same for 
any survey design. You need to know only the sampling weights to estimate almost 
anything through the empirical cdf. If desired, you can smooth the empirical cdf 
before estimating quantiles. Nusser et al. (1996) use a semiparametric approach for 
estimating daily dietary intakes of various nutrients from the Continuing Survey of 
Food Intakes by Individuals, a stratified multistage survey. 

Although the weights may be used to find point estimates through the empirical 
cdf, calculating standard errors is much more complicated, and requires knowledge 
of the sampling design. Typically, in a stratified multistage sample, we calculate 
variances assuming that psus were selected with replacement in each stratum. This 


6, =y1 + Q2 — yi). (7.4) 
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TABLE 7.1 
Estimates from samples in Example 7.3 


Stratified, Stratified 
Quantity Population SRS No Weights with Weights 
Mean 168.6 168.9 164.6 169.0 
Median 167.3 168.8 162.8 167.6 
25th percentile 159.9 159.7 156.6 160.7 
90th percentile 183.2 183.4 1775 181.5 
Variance, S” 124.5 122.6 93.4 116.8 


simplifies the analysis considerably, since we do not need to know the joint inclusion 
probabilities of the psus or any information about the subsampling design to calculate 
the with-replacement variance. In most surveys, the with-replacement variance esti- 
mates are larger than the without-replacement variance estimates, but the increase is 
small if the first-stage sampling fractions are small. Variances of statistics calculated 
from the empirical cdf will be discussed in Chapter 9. 


1A 


Plotting Data from a Complex Survey 


Simple plots reveal much information about data from a small SRS or representative 
systematic sample. Histograms or smoothed density estimates display the shape of 
the data; scatterplots and scatterplot matrices show relationships between variables; 
other plots discussed in Chambers et al. (1983), Cleveland (1994), and Robbins (2005) 
emphasize other features of the data. In a complex sampling design, however, a single 
plot will not display the richness of the data. As seen in Figure 7.4, plots commonly 
used for SRSs can mislead when applied to raw data from non-self-weighting samples. 
Clustering causes numerous difficulties in plotting data from a complex survey, as 
noted in Example 5.7, because we may want to display the clustering structure as well 
as possible unequal weighting in the graphs; the problems are compounded because 
data sets from surveys are often very large and involve several levels of clustering. 

Data should be plotted both with and without weights to see the effect of the 
weights. In addition, data should be plotted separately for each stratum, and for 
each psu if possible to examine variability in the responses. You already know how 
to plot the raw data without weights; in this section we provide some examples of 
incorporating the weights into the graphics. 


14.1 Univariate Plots 
7.4.1.1 Histograms 


One of the simplest plots for displaying the shape of data is the histogram. To construct 
a relative frequency histogram for an SRS of size n, divide the range of the data into 
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k bins with each bin having width b. Then the height of the histogram in the jth bin is 


So ui(i) 


relative frequency for bin j _ ieS 
b a bn 


height( j) = 


where u;(j) = 1 if observation / is in binj and 0 otherwise. If a sample is self-weighting, 
as with an SRS, a regular histogram of the sample data will estimate the population 
probability mass function. 

We saw in Figure 7.4, though, that if a sample is not self-weighting a histogram 
of the raw data may underrepresent some parts of the population in the display. We 
can use the sampling weights to construct a histogram that estimates the population 
histogram. As before, divide the range of the data into k bins with each bin having 
width b. Now use the sampling weights w; to find the height of the histogram in bin j: 


S > wii /) 


icS 


by wi , 


Dividing by the quantity b )°,..; w; ensures that the total area under the histogram 
equals 1. 


height(j) = (7.5) 


To construct a histogram of the height data from the stratified sample in file htstrat.dat 
(Example 7.3), first decide on a bin width, b. We decide to use b = 3 as in Figure 7.4. 
This choice gives 20 histogram bins. The cutpoints for the histogram bins are at 141, 
144, 147, 150, ..., 198, and 201. The first histogram bar includes persons in the sample 
whose heights are in the interval (141, 144]; the sample contains one woman with 
height 142 and one woman with height 144. Each woman in the sample has sampling 
weight 6.25, so the height of the first histogram bar is 


2(6.2 12. 

(9:29) = a = 0.00208. 
by sw; — (3)(2000) 

icS 


The biggest histogram bar includes persons in the sample with heights in the interval 
(165, 168]; the sample contains 19 women and 6 men with heights in this range, so 
the height for the bar corresponding to (165, 168] is 


19(6.25) + 6(25) 268.75 
byw; —__ 3)(2000) 


icS 


= 0.04479. 


The heights for the other histogram bars are computed similarly. The histogram for the 
stratified sample, incorporating the weights, is in Figure 7.6. The histogram with 
the weights shows higher relative frequencies for heights over 165 than does the 
histogram without weights in Figure 7.4; Figure 7.6 gives a better picture of the shape 
of the population distribution. SAS code for creating histograms that incorporate 
the sampling weights is provided on the website. = 
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FIGURE 7.6 
Histogram of height data from stratified sample, incorporating the sampling weights. 
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7.4.1.2 Boxplots 


Side-by-side boxplots, sometimes called box-and-whisker plots, are a useful way to 
display the distribution of a population or to compare domain distributions visually. 
The box in a boxplot has lines at the 25th, 50th, and 75th quantiles, and whiskers that 
extend to the extremes of the data (or, alternatively, to a multiple of the interquartile 
range). If the sample is not self-weighting, the weights should be used to calculate 
the quantiles in the display. 


Consider again the height data in file htstrat.dat. We use the sampling weights to 
estimate the quantiles of the data. Using all 200 observations, we note that F(167)= 
»y,<167 Wi/ Dies Wi = 0.4844 and F(168) = dy <168 Wi/ Dies Wi = 0.5125. Thus, 
any value between 167 and 168 is a median. Several methods can be used to choose 
one value to estimate the median. We interpolate between the two bounds according 
to the empirical cdf probabilities, and use 


0.5 — 0.4844 
= 167 1 167) = 167.6. 
ant ess 0aa ee 
We find the 25th and 75th percentiles similarly. For the 25th percentile, F(160) = 
0.2344, F(161) = 0.2563, and 


timated 25th tile = 160-+ 2:22 — 9-73 6461 — 160) = 160.7 
estimate ercentlie = — ahs 
P 0.2563 — 0.2344 


For the 75th percentile, F(176) = 0.7344, F(177) = 0.7594, and 
timated 75th tile = 176+ eS (177 — 176) = 176.6 
estimate percentile = 0.7594 — 0.7344 = 6. 


SAS PROC SURVEYMEANS will calculate percentiles using the weights; the SAS 
code for calculating percentiles is given on the website with output below. We’ll 
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FIGURE 7.7 

Boxplots of height data from stratified sample, incorporating the sampling weights. The first 
box uses data from the entire sample, the second box uses data from the women, and the third 
box uses data from the men. 
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discuss how these standard errors are calculated in Section 9.5.2; for boxplots, we 
just use the point estimates. 


Quantiles 
Percentile Estimate Std Error 95% Confidence Limits 
25% Q1 160.714286 0.693338 158.759271 161.493819 
50% Median 167.555556 1.011620 165.569707 169.559572 
75% Q3 176.625000 1.330767 172.910731 178.159325 


Quantiles for a domain are estimated in similar fashion, using an empirical cdf 
that includes only observations in that domain. Define x; = 1 if observation 7 is in 
domain d, and 0 otherwise. Then the empirical cdf restricted to domain d, is 


) XiWi 


a iES:yj<y 
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For the women, F (155) = 0.2, F (156) = 0.275, so the 25th percentile for women 
is estimated by 155 + (0.25 — 0.2)/(0.275 — 0.2) = 155.7. Similarly, the median for 
women is 160.7 and the 75th percentile for women is 166.4. The 25th, 50th, and 75th 
percentiles for men are 169, 176, and 180, respectively. Side-by-side boxplots of the 
data, using these estimated quantiles and extending the whiskers to the range of the 
data, are shown in Figure 7.7. Similar boxplots can be constructed from the estimated 
quantiles using SAS code on the website. m= 
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7.4.1.3 Density estimates* 


Smoothed density estimates are useful for displaying the shape of the estimated 
population data for a variable that takes on a wide range of values. The books by 
Scott (1992), Wand and Jones (1995), and Simonoff (1996) are useful references 
on smoothing with data from an SRS. The idea of smoothing methods is to create 
a smooth version of a histogram. Instead of having bars in a histogram, one could 
create a plot by connecting the heights at the midpoints of the histogram bins. Such a 
plot would not be particularly smooth, however, and could be improved by using each 
possible value of y as the midpoint of a histogram bin of width b, finding the height 
for that bin, and then drawing a line through those values. In essence, the histogram 
bars slide continuously along the horizontal axis; as points enter and leave the bar, 
the height corresponding to the midpoint changes. A symmetric density function K, 
called a kernel function, is used to allow more flexibility in the smoothing method. 
Bellhouse and Stafford (1999) and Buskirk and Lohr (2005) adapted kernel density 
estimation to survey data by incorporating the weights, with 


jorb=— wx I, 


b Wi ieS 
ieS 


Commonly used kernel functions include the normal kernel function Ky(t)= 
exp (-?? /2)/J/ 20 and the quadratic kernel function Ko(t)= 3(1 — f’) for |t| <1. 
The sliding histogram described above corresponds to a box kernel with Kg(t) = | for 
|t| < 1/2 and Kg(t) = 0 for |t| > 1/2; in that case, f(y; b) corresponds to the histogram 
height given in (7.5) for a point y in the middle of a bin of width b. 

Figure 7.8 shows a smoothed density estimate for the height data from Exam- 
ples 7.3 and 7.5. The website gives the code for constructing this plot. As with the 
histogram in Figure 7.6, using the sampling weights increases the estimated density 
in the right tail despite the paucity of data in that region. 

The choice of b, called the bandwidth, determines the amount of smoothing to 
be used. Small values of b use little smoothing since the sliding window is small. A 
large value of b provides much smoothing since each point in the plot represents the 
weighted average of many points from the data. One problem with density estima- 
tion in survey data is that respondents may round their answers. For example, some 
respondents may round their height to 165 or 170 cm, causing spikes at those values. 
You may want to choose b large to increase the amount of smoothing, or you may 
want to adopt a model for the effect of rounding by the respondent. 


7.4.1.4 Displaying stratification and clustering information 


The histograms, boxplots, and density estimates for survey data use the sampling 
weights to approximate the corresponding plots that would be constructed if we knew 
the data values for the entire population. 


The 1987 Survey of Youth in Custody (Beck et al., 1988; U.S. Department of Justice, 
1989) sampled juveniles and young adults in long-term, state-operated juvenile insti- 
tutions. Residents of facilities at the end of 1987 were interviewed about family 
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FIGURE 7.8 
Estimated density function for the stratified sample of heights. The circles represent the data 
points in the sample. 
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background, previous criminal history, and drug and alcohol use. Selected variables 
from the survey are in the file syc.dat. 

The facilities form a natural cluster unit for an in-person survey; the sampling 
frame of 206 facilities was constructed from the 1985 Children in Custody (CIC) 
Census. The psus (facilities) were divided into 16 strata by number of residents in the 
1985 CIC. Each of the 11 facilities with 360 or more youth formed its own stratum 
(strata 6-16); each of these facilities was included in the sample and residents of the 
11 facilities were subsampled. In strata 1—5, facilities were sampled with probability 
proportional to size from the 195 remaining facilities; residents were subsampled with 
predetermined sampling fractions. Table 7.2 contains information about the strata. 

The stratum boundaries were chosen so that the number of residents in each stra- 
tum would be comparable. It was originally intended that each resident have proba- 
bility 1/8 of inclusion in the sample, which would result in a self-weighting sample 
with constant weight 8. The facilities in strata 14 and 16, however, had experienced a 
great deal of growth between 1985 and 1987, so the sampling fractions in those strata 
were changed to 1/11 and 1/12, respectively. In strata 1-5, weights varied from about 
5 to about 15, depending on the facility’s inclusion probability and the predetermined 
sampling fraction in that facility. The weights were further adjusted for nonresponse, 
and to match the sample counts with the 1987 census count of youths in long-term, 
state-operated facilities. After all weighting adjustments were made, weights ranged 
from 5 (in stratum 4) to 58 (for some youths in states that required parental permission 
and hence had lower response rates). 
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TABLE 7.2 
Survey of Youth in Custody Stratum Information 


CIC Size Number of Number of Number of Eligible 

(Number of psus in Residents psus in 

Stratum Residents) Frame in CIC Sample 
1 1-59 99 2881 11 
2 60-119 39 3525 7 
3 120-179 30 4355 7 
4 180-239 13 2594 7 
3 240-359 14 4129 7 


To estimate population quantities with standard errors from a stratified multistage 
sample such as the Survey of Youth in Custody, you need to know the weights, 
the stratification variable, and the variable describing the first-stage cluster units. In 
syc.dat, the weights are in variable finalwt, the strata are in variable stratum and the 
facilities (psus) are in variable facility. There is only one facility in each of strata 
6-16, so that a stratified random sample of individuals is taken in each of those strata; 
we define the psus for those strata to be individuals rather than the facility so that 
they contribute to the standard errors of the estimates. SAS code for calculating the 
average and percentiles of age is on the website, with output given below. We shall 
discuss how to compute standard errors of quantiles in Section 9.5.2. 


Std Error 
Variable Mean of Mean 95% CL for Mean 
age 16.639293 0.128882 16 .386326 16.892260 
Quantiles 
Percentile Estimate Std Error 95% Confidence Limits 
% Min 11.000000 ‘ ‘ ‘ 
25% Q1 14.805746 0.225394 14 .363348 15.248145 
50% Median 15.917433 0.175991 15.57.2001 16.262864 
75% Q3 17.205184 0.154592 16.901754 17.508613 
100% Max 24.000000 


Let’s look at some plots of the age of residents. Some youths are over age 18 
because California Youth Authority facilities were included in the sample. As the 
survey aimed to be approximately self-weighting, the histogram of the unweighted 
data in Figure 7.9 and the empirical probability mass function incorporating the 
weights (variable finalwt) in Figure 7.10 are overall similar in shape. Some dis- 
crepancies appear on closer examination, though—the weights indicate that youths 
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FIGURE 7.9 
Histogram of all data, not incorporating weights. The histogram shows the distribution of ages 
in the sample, but does not necessarily reflect the distribution of ages in the population. 
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FIGURE 7.10 

Estimated probability mass function for age, f (y). The shape is similar to that of the histogram 

of the raw data, but there are relatively more 15-year-olds and relatively fewer 17-year-olds. 
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aged 15 were somewhat undersampled due to unequal selection probabilities and 
nonresponse, while youths aged 17 were somewhat oversampled. 

If we were only interested in the distribution for the entire population, we could 
concentrate on plots such as those in Figures 7.9 and 7.10, and similar plots informa- 
tive about univariate distributions such as probability-probability or quantile-quantile 
plots (see Exercises 19 and 20). But we would also like to explore stratum-to-stratum 
differences in age distribution. Figure 7.11 incorporates weights into boxplots of the 
data. 

As the response variable age is discrete, we can show even more detail for each 
stratum. Figure 7.12 displays the sum of the weights for each age within each stratum. 
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FIGURE 7.11 

Boxplot of age distributions for each stratum, incorporating the weights. Note the wide vari- 
ability from stratum to stratum. 
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FIGURE 7.12 

Age distribution for each stratum. The area of each circle is proportional to the sum of the 
weights for sample observations in that stratum and age class. The highest number of youths 
under age 18 are in strata | through 5. 
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The estimated relative frequency of youths with that age in each stratum is indicated 
by a circle whose area is proportional to the sum of the weights. 

We may also be interested in the facility-to-facility variability. Figures 7.13 and 
7.14 show similar plots for the psus in stratum 5. These plots could be drawn for each 
stratum to show differences in psu variability among the strata. = 


FIGURE 7.13 
Boxplots of ages, incorporating weights, for the psus in stratum 5. The width of each boxplot 
is proportional to the number of sample observations in that facility. 
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FIGURE 7.14 
Age distribution for each psu in Stratum 5. The area of each circle is proportional to the sum 
of the weights for sample observations in that psu and age class. 
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142 Bivariate Plots 


You may also be interested in bivariate relationships among variables. We typically 
explore such relationships visually using a scatterplot. With complex survey data, 
unequal weights should be considered for interpreting bivariate relationships. 

Since they involve two variables, scatterplots are more complicated than univariate 
displays. Many government surveys have large amounts of data. The U.S. Current 
Population Survey (CPS), for example, collects data from 60,000 households each 
year (U.S. Census Bureau, 2006a). A scatterplot of two continuous variables from 
the CPS will have so many data points that the graph may be solid black and useless 
for visual inspection of the data. In addition, if both variables take on integer values, 
for example age and years in workforce, many points will share the same x and y 
values. 

The challenging part for scatterplots is how to incorporate the weights. In a his- 
togram, only the horizontal axis uses the data values so the weights can be incorporated 
in the relative frequencies displayed on the vertical axis. But in a scatterplot, the hor- 
izontal axis displays information about the x variable and the vertical axis displays 
information about the y variable, so the weight information must be incorporated 
by some other means. Korn and Graubard (1998; 1999, Section 3.4) suggest several 
methods for constructing scatterplots from complex survey data. We illustrate some of 
these plots, and others, using the 2003-2004 NHANES data, plotting the body mass 
index vs. age for a stratified multistage sample of approximately 10,000 persons. It 
is generally a good idea to construct a variety of plots since some plots will work 
better with a data set than others. Body mass index is calculated as weight/height”, 
in units kg/m’. Age is topcoded at 85 to protect confidentiality of the respondents; 
any person with age greater than 85 is assigned age value 85. SAS code used to con- 
struct these plots from data in file nhanes.dat is given on the website. Figure 7.15 
shows a plot of the raw data without weights; as you can see, the data set is so 
large that it is difficult to see the structure of the bivariate relationship from the 
graph. 


7.4.2.1 Plot with circles proportional in size to observation weights 


The plot in Figure 7.15 does not include information about the weights. The NHANES 
survey is designed so that it oversamples areas with large minority populations. The 
sample weights of individuals in those areas, therefore, are smaller. To get a better 
view of the data, we should incorporate the unequal weights. One way of doing that 
is to use a circle as a plotting symbol, and, for each observation, make the area of the 
circle proportional to the weight of the observation. This plot for the NHANES data 
is shown in Figure 7.16. The data are easier to see on this plot than in Figure 7.15 
because the plotting symbols are smaller; however, there are still so many data points 
that some features may be obscured. Observations with small weights have very small 
circles and are nearly invisible. In larger data sets, such as the CPS, a weighted plot 
will still have such high data density in areas that the plot will appear to be a solid 
mass. 
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FIGURE 7.15 

Plot of raw data from NHANES. There are so many data points that it is difficult to see patterns 
in the plot; in addition, no weighting information is used. This plot is not recommended for 
complex survey data with unequal weights. 
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FIGURE 7.16 
Weighted circle plot of NHANES data. The circle size for each point is proportional to the 
weight for that point. 
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7.4.2.2 Plot a subsample of points 


Instead of plotting all the data, we can plot a subset of the data. Since the sam- 
pling weight of an observation can be interpreted as the number of population units 
represented by that unit, a plot of a subsample selected with probabilities proportional 
to the weights can be interpreted much the same way as a regular scatterplot (see 
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FIGURE 7.17 


Plot of subsample of NHANES data, selected with probability proportional to the weight 
variable. 
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Exercise 23). Figure 7.17 shows a scatterplot of a sample of 500 points selected 
with replacement from the NHANES data with probability proportional to the weight 
variable. This plot can be repeated with different subsamples, and each plot will be 
different. Each plot, however, has less information than the full data set since it is 
based on a subsample of the data. Unusual observations such as outliers might not 
appear on a single plot. 


7.4.2.3 Use circles to represent weights 


This idea is similar to creating bins for a histogram. In a histogram, the y values are 
grouped into a bin, and the sum of the weights is found for the y values falling into 
each bin. To extend this idea to a scatterplot, divide the region into rectangles. Find the 
sum of the weights for the (x, y) values falling in each rectangle. Then, plot a circle 
with area proportional to the sum of the weights at the midpoint of the rectangle. 
Figure 7.18 shows the NHANES data in bins formed by rounding the x and y values 
to the closest multiple of 5. This type of plot is especially useful if the data set contains 
many points at the same values of (x, y), since the plot displays the multiplicity of 
points with the same values. 


7.4.2.4 Use shading to represent weights 


Instead of using the size of the circle to represent the sum of the weights in a bin, as 
in Figure 7.18, you can use shading to indicate the sum of weights. This often allows 
you to use more levels for the x and y values than in a plot with circles. For the plot 
in Figure 7.19, we form bins by rounding the x and y values to the nearest integer, 
creating a grid of x and y values. For each bin, calculate z= sum of the weights for 
the points with x and y values in that bin. The shading in Figure 7.19 is proportional 
to the average of the z values for the four corners of each rectangle. 
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FIGURE 7.18 
Circle plot of NHANES data. The area of each circle is proportional to the sum of the weights 
of the set of observations near the center of the circle. 
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FIGURE 7.19 
Shaded plot of NHANES data. The shading relies on the sum of the weights for each rectangle. 
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7.4.2.5 Side-by-side boxplots 


Instead of creating circles at a set of gridpoints at the data or using shading, we can 
group the x variable into bins and draw a boxplot at the midpoint of each x bin. 
We then use the weights to calculate the quantiles of each bin as in Example 7.6. 
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FIGURE 7.20 
Side-by-side boxplots of NHANES data. The width of each box is proportional to the sum of 
the weights of the set of observations used for the box. The + in each box denotes the mean. 
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Figure 7.20 shows side-by-side boxplots of the NHANES data, where the age (x) 
variable is grouped in bins by rounding values to the closest multiple of 5. The width 
of each boxplot is proportional to the sum of the weights for observations in that bin. 


7.4.2.6 Smoothed function estimates 


Korn and Graubard (1998) and Bellhouse and Stafford (2001) propose using smoothed 
function estimates to display trends in the data. Kernel smoothing with weights, 
as in the smoothed density estimates in Figure 7.8, is used to obtain a trend line. 
Wand and Jones (1995) and Simonoff (1996) present methods for estimating trend 
lines for data assumed to be independent and identically distributed; Exercise 25 of 
Chapter 11 adapts these methods for data with unequal weights. The simplest method 
for estimating a trend line takes a weighted average of the y; values that fall in the 
data window for x; at each point x, the function g(x) is estimated by 
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Other methods fit a straight line or polynomial in each window. 

The trend line can be displayed by itself, or (recommended) as an overlay on 
one of the other plots. Figure 7.21 displays a trend line, computed using local linear 
regression, with the weighted circle plot from Figure 7.16; we changed the color of 
the data points from black to gray so that the trend line is more visible. Note that the 
trend line approximately follows the line of means in the boxplots from Figure 7.20. 
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FIGURE 7.21 
Weighted circle plot of NHANES data, with trend line. 
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18 
Design Effects 


Cornfield (1951) suggested measuring the efficiency of a sampling plan by the ratio 
of the variance that would be obtained from an SRS of k observation units to the 
variance obtained from the complex sampling plan with k observation units. Kish 
(1965) named the reciprocal of Cornfield’s ratio the design effect (abbreviated deff) 
of a sampling plan and estimator and used it to summarize the effect of the design on 
the variance of the estimator: 


deff(plan, statistic) 
V(estimator from sampling plan) 


(7.6) 


7 V(estimator from an SRS with same number of observation units) | 


For estimating a mean from a sample with n observation units, 


P vy 
deff(plan,5) = ) 


The design effect provides a measure of the precision gained or lost by use of the 
more complicated design instead of an SRS. Although it is a useful concept, it is not 
a way to avoid calculating variances: You need an estimate of the variance from the 
complex design to find the design effect. Of course, different quantities in the same 


EXAMPLE 7.8 
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survey may have different design effects. Kish showed how the design effect allows 
you to use prior knowledge for the survey design. 

The SRS variance is generally easier to obtain than VG). If estimating a proportion, 
the SRS variance is approximately p(1 — p)/n; if estimating another type of mean, the 
SRS variance is approximately S?/n. So if the design effect is approximately known, 
the variance of the estimator from the complex sample can be estimated by (deff x 
SRS variance). We can estimate the variance of an estimated proportion p by 

V(p) = deff x oe 

We have seen design effects for several sampling plans. In Section 3.4 the design 
effect for stratified sampling with proportional allocation was shown to be approxi- 
mately 
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Unless all of the stratum means are equal, the design effect for a stratified sample will 
usually be less than 1—-stratification generally gives more precision per observation 
unit than an SRS. 

We also looked extensively at design effects in cluster sampling, particularly in 
Section 5.2.2. From (5.10), the design effect for single-stage cluster sampling when 
all psus have M ssus is approximately 


1+(M — IICC. 


The intraclass correlation coefficient (ICC) is usually positive in cluster sampling, so 
the design effect is usually larger than 1; cluster samples usually give less precision 
per observation unit than an SRS. 

In surveys with both stratification and clustering, we cannot say before calculating 
variances for our sample whether the design effect for a given quantity will be less 
than 1 or greater than 1. Stratification tends to increase precision and clustering tends 
to decrease it, so the overall design effect depends on whether more precision is lost 
by clustering than gained by stratification. 


For the bed net survey discussed in Example 7.1, the design effect for the proportion 
of beds with nets was calculated to be 5.89. This means that about six times as many 
observations are needed with the complex sampling design used in the survey to obtain 
the same precision that would have been achieved with an SRS. The high design effect 
in this survey is due to the clustering: Villages tend to be homogeneous in bed net 
use. If you ignored the clustering and analyzed the sample as though it were an SRS, 
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the estimated standard errors would be much too low, and you would think you had 
much more precision than really existed. m= 


151 Design Effects and Confidence Intervals 


If the design effect for each statistic is known, one can use it in conjunction with 
standard software to obtain CIs for means and totals. If n observation units are sampled 
from a population of N possible observation units and if p is the survey estimate of the 
proportion of interest, an approximate 95% CI for p is (assuming the finite population 
correction is close to 1): 


—— 
p+1.96Vdert, | PL), (7.8) 
n 


When estimating a mean rather than a proportion, if the sample is large enough to 
apply a central limit theorem, an approximate 95% CI is 


s §2 
y+ 1.96Vdeff,/ —, 
n 


where §? may be calculated using (7.3). 
Kish (1995) and other authors sometimes use design effect to refer to the quantity 


SEC plan) 
s 
Jn 
so that deft (the name deft is due to Tukey, 1968) will be an appropriate multiplier 
for a standard error or CI half-width. In practice, as Kish points out, choice of deff or 
deft makes little difference, but you need to pay attention to which definition a survey 

uses. 


’ 


deft(y) = 


152 Design Effects and Sample Sizes 


Design effects are extremely useful for estimating the sample size needed for a survey. 
That is the purpose for which it was introduced by Cornfield (1951), who used it 
to estimate the sample size that would be needed if the sampling unit in a survey 
estimating the prevalence of tuberculosis was a census tract or block rather than 
an individual. The maximum allowable error was specified to be 20% of the true 
prevalence, or 0.2 x p. If the prevalence of tuberculosis was 1%, the sample size for 
an SRS would need to be 
2 
w= Lee =P) = 9508. 
(0.2p)2 

Cornfield recommended increasing the sample size for an SRS to 20,000, to give more 
precision in separate estimates for subpopulations. He estimated the design effect for 
sampling census tracts rather than individuals to be 7.4 and concluded that if census 
tracts, which averaged 4600 individuals, were used as a sampling unit, a sample size 
of 148,000 adults, rather than 20,000 adults, would be needed. 
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If you know the design effect for a similar survey, you only need to be able to 
estimate the sample size you would take using an SRS. Then multiply that sample 
size by deff to obtain the number of observation units you need to observe with the 
complex design. For sample size purposes, you may wish to use separate design 
effects for each stratum. 


16 


The National Crime Victimization Survey 


Most crime statistics given in U.S. newspapers come from the Uniform Crime Reports, 
compiled by the FBI from reports submitted by law-enforcement agencies. But the 
Uniform Crime Reports underestimate the amount of crime in the U.S., largely 
because not all crimes are reported to the police. 

The National Crime Victimization Survey (NCVS) is a large national survey 
administered by the U.S. Bureau of Justice Statistics with interviews conducted by 
the U.S. Census Bureau. Like the CPS, the NCVS follows a stratified, multistage 
cluster design. Information on the design of the CPS is found in U.S. Census Bureau 
(2002a); U.S. Department of Justice (2002, 2006) describe the NCVS design. The 
NCVS surveys households from across the United States and asks household mem- 
bers 12 years old and older about their experiences as victims of crime in the past six 
months. 

We describe the design used for the 2000 NCVS.* The NCVS design is similar 
to that of many other large government surveys: Most have similar methods of strat- 
ification, clustering, and ratio estimation. We shall return to the NCVS in Chapter 8, 
to show how weights are adjusted for nonresponse and undercoverage in this large 
complex survey. 

A psu in the NCVS is a county, a group of adjacent counties, or a large metropolitan 
area. Any psu with population about 550,000 or more (according to the 1990 census) 
is automatically included in the sample. Such a psu is said to be self-representing 
(SR) because it does not represent any psus other than itself. The probability this psu 
will be selected is one. 

All other psus are grouped into strata so that each stratum group has a population 
of about 650,000. In the NCVS, psus are grouped into strata based on geographic 
location, demographic information available from the 1990 census, and on Uniform 
Crime Reports crime rates. One psu is selected from each of these strata with proba- 
bility proportional to population size; this psu is called non-self-representing (NSR) 
because it is supposed to represent not just itself but all the psus in that stratum. 
Within a stratum, a psu with 100,000 population is twice as likely to be selected for 
the sample as a psu with population 50,000. The 2000 NCVS design had 93 SR psus 
and 110 NSR psus. Because victimization rates vary regionally, the large number of 
strata in the NCVS increases the precision of the estimates. 

The second stage of sampling involves selecting enumeration districts (EDs), 
geographic areas used in the decennial census; an ED typically contains about 300 to 


2Many structural features of the design are the same for more recent years, although there has been a 
drastic reduction in sample size. In 2006, the NCVS selected new psus based on the 2000 Census. Other 
designs are being considered for the NCVS starting after 2010 (National Research Council, 2008). 
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TABLE 7.3 
Sampling Stages for the 2000 NCVS 


Stage Sampling Unit Stratification 
1 psu (county, set of adjacent Location, demographic 
counties, or metropolitan area) information, and 


crime-related characteristics 
Enumeration District 
Cluster of 4 housing units 
Household 
Person within household 


nAnkwWhd 


400 households (750 to 1500 persons), but EDs vary considerably in population and 
land area. The EDs are selected with probability proportional to their 1990 census 
population size; the number of ED’s selected within a psu is determined so that 
the sample of ED’s will be approximately self-weighting. In the census listing, EDs 
are arranged by geographic location; EDs are selected using systematic sampling as 
described in Section 6.2, so that the sampled EDs will be distributed geographically 
over the selected psu. If the overall sampling rate is 1/x, in SR psus the sampling 
interval is x. If using census records for the sampling frame, the addresses are num- 
bered from | to the number of households in the psu. A random number k is chosen 
between | and x, and the ED’s chosen to be in the sample are the ones containing 
addresses k, k +x, k + 2x, etc. In NSR psus, the sampling interval is (probability psu 
is selected)(x). 

In the third stage of sampling, each selected ED is divided into clusters of approx- 
imately four housing units each. The census lists housing units within an ED in 
geographic order, and when possible, that listing is used. A sample of those clusters 
is taken, and each housing unit in a selected cluster of about four housing units is 
included in the sample. All persons aged 12 and over in the housing unit are to be 
interviewed for the survey. 

The census listings are supplemented by area sampling. If the census listing 
of housing units were the only one used throughout that decade, there would be 
substantial undercoverage of the population, since no newly built housing units would 
be included in the sample. To allow new housing units to be included in the sample, 
the NCVS uses a sample of building permits for residential units and samples those. 
In area sampling, a field representative lists all housing units or other living quarters 
within a selected area of an ED, and that listing then serves as the sampling frame for 
that area. 

In summary, the stages for the NCVS are shown in Table 7.3. Interviews for the 
NCVS with persons aged 12 and over are taken every month, with the housing units 
selected for the sample covered in a six-month period—this allows the interviewing 
workload to be distributed evenly throughout the year. To allow for longitudinal 
analyses of the data, and to be relatively certain that crimes reported for a six-month 
period occurred during those six months and not during an earlier time, the residents 
of each housing unit are interviewed every six months over a three-year period, for 
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a total of seven interviews. For 2000, about 43,000 housing units were interviewed. 
Altogether, about 80,000 persons gave responses to the questionnaire. 

Clearly, this is acomplex survey design, and weights are used to calculate estimates 
of victimization rates and total numbers of crimes. The survey is designed to be 
approximately self-weighting, so initially each individual is assigned the same base 
weight of (1/probability of housing unit selection). 

The NCVS is designed to be self-weighting, but sometimes a selected cluster 
within an Enumeration District has more housing units than originally thought; for 
example, an apartment building might have been erected in place of detached hous- 
ing units. Then only housing units in a subsample of the cluster are interviewed. If 
subsampling is used, the units subsampled are assigned a weighting control factor 
(WCF). If only one-third are sampled, for instance, the sampled units are assigned a 
WCE of 3, because they will represent three times as many units. If a housing unit is 
in a cluster in which subsampling is not needed, it is assigned a WCF of 1. At this 
level, a sampled housing unit represents 


base weight x WCF 


housing units in the population. This is the sampling weight for a housing unit sampled 
in the NCVS; as the survey attempts to interview all persons aged 12 and older in the 
sampled housing units, the sampling weight for a person in the sample is set equal to 
the weight for the housing unit. 

All other weighting adjustments in the NCVS adjust for nonresponse, or are used 
in poststratification. Some persons selected to be in the sample are not interviewed 
because they are absent or refuse to participate. The interviewer gathers demographic 
information on the nonrespondents, and that demographic information is used to 
adjust the weights in an attempt to counteract the nonresponse. (This is an exam- 
ple of weighting class adjustments for nonresponse, as discussed in Section 8.5.) 
Two different weighting adjustments for nonresponse are used: the within household 
noninterview adjustment factor (WHHNAPF), and the household noninterview adjust- 
ment factor (HHNAF). In each adjustment factor, the goal is to increase weights of 
interviewed units that are most similar to units that cannot be interviewed. 

The WHHNAF is used to compensate for individual nonrespondents in households 
in which at least one member responded to the survey. It is computed separately for 
each of the regions (Northeast, Midwest, South, and West) of the United States. Within 
each region, the persons from households in which there was at least one respondent 
are classified into 24 cells using the race of the person designated as reference person, 
the age and sex of the nonresponding household member, and the nonrespondent’s 
relationship to the reference person. Any of the 24 cells that contain fewer than 30 
interviewed cases, or that produce a WHHNAF of two or more, are combined with 
similar cells; the collapsing of cells prevents some individuals from having weights 
that are too large. Then 


sum of weights of all persons in cell 


WHHNAF = ; ; ; ; : 

sum of weights of all interviewed persons in cell 
The weights used to calculate the WHHNAF are the weights assigned to this point 
in the weighting procedure, that is, (base weight) x (WCF). Thus the weights of 
respondents in a cell are increased so that they represent the nonrespondents, and the 
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persons in the population that the nonrespondent would represent, in addition to their 
original representation. After applying the WHHNAF, the weight for an individual is 


base weight x WCF x WHHNAF. 


Not all nonresponse is from nonresponding individuals in responding households. 
About 3 to 4 percent of households are eligible for the survey but cannot be reached 
or refuse to respond; the household noninterview adjustment factor is used to attempt 
to compensate for nonresponse at the household level. For the HHNAF, households 
are grouped into cells by race of the reference person and metropolitan area and 
urban/suburban/rural status of the residence. Then 


sum of weights of all persons in cell 
HHNAF = 


sum of weights of all interviewed persons in cell’ 


As with the WHHNAF, the weights used in calculating the HHNAF are the weights 
calculated so far: (base weight) x (WCF) x (WHHNAF). Cells are combined until 
the HHNAF is less than two. 

At this point in the construction of the weights, the weight assigned to an 
individual is 


base weight x WCF x WHHNAF x HHNAF. 


The sampling weights for responding individuals have been increased so that they 
also represent nonrespondents who are demographically similar. 

Because the NCVS is a sample, the demographic information in the sample usually 
differs from that of the U.S. population as a whole. Two stages of ratio estimation are 
used to adjust the sample values so they agree better with updated census information. 
This adjustment is expected to reduce the variance of estimates of victimization rates. 

The first stage of ratio estimation is used in NSR psus only, and is intended to 
reduce the variability that results from using one psu to represent the stratum. Ratio 
estimation is used to assign different weights to cells stratified by region, MSA status, 
and race. The first-stage factor, 


FSF = independent count of number of persons in cell 


sample estimate (sum of weights) of the number of persons in cell’ 


adjusts for differences between census characteristics of sampled NSR psus and char- 
acteristics of the full set of NSR psus. The FSF equals one for SR psus, and is truncated 
at 1.3 for NSR psus. 

The second stage of ratio estimation is applied to everyone in the sample. The 
persons in the sample are classified into 72 groups on the basis of their age, race, and 
sex. Cells need to have a count of at least 30 interviewed persons, and the SSF needs 
to be between 0.5 and 2.0; cells are collapsed until these conditions are met. 


SSF = independent count of number of persons in cell 


sample estimate (sum of weights) of the number of persons in cell” 


The second-stage factor is a form of poststratification: it is intended to adjust the 
sample distribution of age, race, and sex so that the cross-classification agrees with 
independently taken counts that are thought to be more accurate. If the sum of weights 


316 Chapter 7: Complex Surveys 


FIGURE 7.22 

Boxplots of weights for the 2000 NCVS, for all persons, white males, white females, non- 
white males, nonwhite females, persons under age 25, persons over age 25, and victims of 
violent crime. The horizontal lines represent the 95th percentile, 75th percentile, median, 25th 
percentile, 5th percentile, and minimum. Note that the weights are much higher for nonwhite 
males, indicating the higher nonresponse and undercoverage in that group. 
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of elderly white women in the sample is larger than the current “best” estimate of the 
number of elderly white women in the population from updated census information, 
then SSF will be less than one for all elderly white women in the sample. 

After all the adjustments, the final weight for person i is 


w; = Base Weight x WCF x WHHNAF x HHNAF x FSF x SSF. 


The weight w; is used as though there were actually w; persons in the population 
exactly like the one to which the weight is attached. In the 2000 NCVS, the per- 
son weights range from 1100 to 9000, with most weights between 1500 and 2500. 
Figure 7.22 gives boxplots for the weights for persons interviewed between July and 
December of 2000. The weights are included on the public use data files of the NCVS: 
To use them to estimate the total number of aggravated assaults reported by white 
females, you would define 


___ J 1 if person 7 is a white female who reported an aggravated assault 
10 otherwise 


and use yee wiy; aS your estimate. 

Even though the nonresponse is relatively low in the NCVS, the weights make 
a difference in calculating victimization rates. Estimates of victimization rates are 
generally higher when weights are used than when they are not used. Young black 
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FIGURE 7.23 
Histograms of victim ages, without and with weights. The histogram constructed using weights 
has a higher frequency of victims in the younger age groups. 
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male respondents to the survey are disproportionately likely to be victims of crime, 
and undercoverage and nonresponse among black males is high. Figure 7.23 shows the 
difference that using weights makes in a histogram of ages of victims of violent crime. 
The sampling design and the weighting scheme are complicated in the NCVS, so 
variance estimation is also complicated. Variances are now calculated by replication 
methods, described in Chapter 9. To protect confidentiality of respondents, the Bureau 
of Justice Statistics does not release the actual strata and psu variables in the public-use 
data sets; instead, it creates pseudo-strata and pseudo-psu variables that can be used 
to estimate variances. For variance estimation, we treat the psus as though they are 
sampled with replacement, so the subsampling information is not needed. The overall 
design effect for the NCVS, and for similar U.S. government surveys, is about two. 


Sampling and Design of Experiments* 


Numerous parallels between sample surveys and designed experiments are discussed 
in Fienberg and Tanur (1987) and Yates (1981). Some of these parallels are noted in 
this section. 

Simple random sampling, in which the universe U/ has N units, is similar to 
the randomization approach to the comparison of two treatments using a total of N 
experimental units. To test the hypothesis Ho: 41 = (42, randomly assign n of the N 
units to treatment | and the remaining N — n units to treatment 2. The observed 
value of the test statistic is compared with the reference distribution based on all 


possible assignments of experimental units to treatments. The p-value comes 


from the randomization distribution. Using randomization for inference dates back 
to Fisher (1925), and the theory is developed in Kempthorne (1952). 
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Randomization serves similar purposes in sampling and in designing experiments. 
In sampling, the goal is to be able to generalize our results to the whole population, 
and we hope that randomization gives us a representative sample. When we design 
an experiment, we attempt to “randomize out” all other possible influences, and we 
hope that we can separate the differences due to the treatments from random error. 
In both cases, we can quantify how often we expect to have a sample or a design that 
gives us a “bad” result. This quantification appears in CIs: 95% of possible samples 
or possible replications of an experiment are expected to yield a 95% CI that contains 
the true value. 

The purpose of stratification is to increase the precision of our estimates by group- 
ing similar items together. The same purpose is met in design of experiments with 
blocking. Cluster samples also group similar items together, but the purpose is con- 
venience, not precision. An analogue in design of experiments is a split-plot design, 
which generally gives greater precision in the subplot estimate than in the whole plot 
estimate. 

The structural similarity between surveys and designed experiments was exploited 
by using ANOVA tables to develop the theory of stratification and cluster sampling. We 
used a fixed-effects one-way ANOVA for a model-based approach to stratification and 
a random-effects one-way ANOVA for a model-based approach to cluster sampling. 
Much of the theory in cluster sampling is similar to the theory of random-effects 
models; in the models in Chapters 5 and 6, we relied on variance components to 
explain the dependence in the data. 

Poststratification and ratio and regression estimation in sampling allows us to 
increase the precision of estimators by taking advantage of the relationship between 
the variable of interest and other classification variables; the same goal in designed 
experiments is met by using covariate adjustment, as in analysis of covariance. 

Both design of experiments and sampling are involved in similar debates between 
using a randomization theory approach or using a model-based approach. We have 
touched on the different philosophical approaches for estimating functions of totals 
in Sections 2.9, 4.6, 3.6, 5.6, and 6.7, but much more has been said. We encourage the 
interested reader to start with the discussion papers by Smith (1994) and Hansen et al. 
(1983). Royall (1992a) succinctly summarizes a model-based approach to sampling. 
Brewer (2002) discusses implications of different philosophies of inference in survey 
sampling. 

Finally, in both sample surveys and designed experiments, it is crucial that ade- 
quate effort be spent on the design of the study. No amount of statistical analysis, 
however sophisticated, can compensate for a poor design. Chapter | presented exam- 
ples of disastrous results from selection bias resulting from poor survey design or 
execution. A call-in poll is not only useless for generalizing to a population but also 
harmful, as people may believe its statistics are accurate. Similarly, little can be con- 
cluded about the efficacy of treatments A and B for a medical condition if the most ill 
patients are assigned to treatment A; if the mean duration of symptoms is significantly 
less for treatment B than for treatment A, is the difference due to the treatment or to 
the difference in the patients? 

Of course, it is sometimes possible to adjust for an imperfect design in the analysis. 
If a measure of the severity of the illness at the beginning of the study is available, it 
could be used as a covariate in comparing the two treatments, although there will still 
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be worries about confounding with other, unmeasured quantities. Values for missing 
cells in a two-way ANOVA design can be estimated by a model. Similarly, available 
information about nonrespondents can be used to improve estimation in the presence 
of nonresponse, as discussed in the next chapter. 


Chapter Summary 


Many large surveys have a stratified multistage sampling design, in which the psus 
are selected by stratified sampling and then subsampled. Estimators of population 
quantities from a stratified multistage sample are calculated by combining the prin- 
ciples from Chapters 2-6. In most instances, only the stratification and information 
from the first stage of clustering are used to calculate standard errors of estimates. 

Any population quantity can be estimated from the sample using the weights. 
The empirical cdf and the empirical probability mass function estimate the cdf and 
probability mass function of the population by incorporating the weights w;. Since w; 
can be thought of as the number of observations units in the population represented by 
observation unit 7 in the sample, the empirical cdf can be thought of as the observed 
cdf of a pseudo-population in which observation i in the sample is replicated w; times. 

Graphs that are commonly used for displaying data from an SRS can be adapted for 
complex survey data by incorporating the survey weights. Histograms, boxplot, and 
scatterplots that use the survey weights display features of the data that are sometimes 
obscured in an analysis that only reports summary statistics. 

Although the survey weights can be used to find a point estimate of any population 
quantity, the weights are not sufficient information to calculate standard errors of 
statistics. Standard errors depend on the stratification and clustering in the survey 
design. The design effect, which is the ratio of the variance of a statistic calculated 
using the complex survey design to the variance that would have been obtained from 
an SRS with the same number of observation units, is useful for assessing the effect 
of design features on the variance. The design effect is often used to determine the 
sample size needed for a complex survey. 


Key Terms 


Design effect: Ratio of the variance of an estimator from the sampling plan to the 
variance of an estimator from an SRS with the same number of observation units. 
Empirical probability mass function: An estimator of the probability mass function 
using sampling weights: f(y) = ess Wil Dies Wi- 

Probability mass function: f(y)=(number of observation units in the population 
whose value is y)/N gives the distribution of the finite population. 

Stratified multistage sample: A sampling design in which primary sampling units 
are grouped into strata; a probability sample is taken of the psus in each stratum. 
Secondary sampling units are then subsampled within each selected psus. In some 
cases, the selected ssus are also clusters and are themselves subsampled. 
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For Further Reading 


The books edited by Skinner et al. (1989) and Chambers and Skinner (2003) are 
good places to start your reading about complex surveys. The volumes edited by 
Pfeffermann and Rao (2009a, 2009b) give a wealth of information on current topics 
in survey sampling. Two papers by Kish (1992, 1995) further explain the ideas behind 
weighting and design effects. The idea of using design effects for sample size esti- 
mation was introduced by Cornfield (1951); the paper gives an interesting example 
of sampling in practice. 

Korn and Graubard (1999) present the theory of sampling with application to the 
special problems involved in health surveys. They also emphasize plotting data from 
surveys, and describe a number of methods for constructing scatterplots and other 
plots with survey data. 


A. Introductory Exercises 


You are asked to design a survey to estimate the total number of cars without permits 
that park in handicapped parking places on your campus. What variables (if any) 
would you consider for stratification? For clustering? What information do you need 
to aid in the design of the survey? Describe a survey design that you think would 
work well for this situation. 


Repeat Exercise | for a survey to estimate the total number of books in a library that 
need rebinding. 


Repeat Exercise | for a survey to estimate the percentage of persons in your city who 
speak more than one language. 


Repeat Exercise | for a survey to estimate the distribution of number of eggs laid by 
Canada geese. 


The organization “Women tired of waiting in line” wants to estimate statistics about 
restroom usage. Design a survey to estimate the average amount of time spent 
by women in a restroom and the average time spent by men in a restroom at your 
university. 


Use the data in file integerwt.dat for this exercise. The strata are constructed with 

N, = 200, Nz = 800, N3 = 400, N4 = 600. 

a Take a stratified random sample with nj = 50, nz =50, n3 = 20, and ng =25. 
Calculate the sampling weight w; for each observation in your sample (the sample 
sizes were selected so that each weight is an integer). 


b Using the weights, estimate y,,, S*, and the 25th, 50th, and 75th percentiles of the 
population. 


ce Now create a “pseudo-population” by constructing a data set in which the data 
value y; is replicated w; times. Your pseudo-population should have N = 2000 
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observations. Estimate the same quantities in (b) using the pseudo-population 
and usual formulas for an SRS. How do the estimates compare with the estimates 
from (b)? 


B. Working with Survey Data 


Using the data in nybight.dat (see Exercise 18 of Chapter 3), find the empirical mass 
function of number of species caught per trawl in 1974. Be sure to use the sampling 
weights. 


Using the data in teachers.dat (see Exercise 15 of Chapter 5), use the sampling weights 
to find the empirical mass function of the number of hours worked. What is the design 
effect? 


Using the data in measles.dat (see Exercise 16 of Chapter 5), what is the design effect 
for percentage of parents who received a consent form? For the percentage of children 
who had previously had measles? 


Using the data in file statepop.dat (see Example 6.5 of Chapter 6), draw a histogram, 
using the weights, of the number of veterans. How does this compare with a histogram 
that does not use the weights? 


Using the data in file statepop.dat (see Example 6.5 of Chapter 6), draw one of the 
scatterplots, using the weights, of the number of veterans vs. number of physicians. 


The Survey of Youth in Custody sampled youth who were residents of long-term 
facilities at the end of 1987. Is the sample representative of youth who have been in 
long-term facilities in 1987? Why, or why not? 


The file syc.dat, used in Example 7.7 contains other information from the 1987 Survey 
of Youth in Custody. Draw a histogram, using the weights, for the age of the youth at 
first arrest. What is the average age of first arrest? The median? The 25th percentile? 
Use the “final weight” to estimate these quantities. How do your estimates compare 
to estimates obtained without using weights? 


Using the file syc.dat and the final weights, estimate the proportion of youths who 


a are age 14 or younger 

b are held for a violent offense 

c lived with both parents when growing up 

d= are male 

e are Hispanic 

f grew up primarily in a single-parent family 


g_ have used illegal drugs. 
Give 95% CIs for your answers. 


Use the data in file nhanes.dat for this exercise. (If you prefer, you may download 
the NHANES data from the website at www.cdc.gov/nchs/nhanes.htm.) Triceps skin- 
fold measurements are sometimes used as a gauge of body fat. We are interested in 
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the relation between y = triceps skinfold (variable bmxtri) and x = body mass index 

(variable bmxbmi). 

a_ Estimate the mean value of triceps skinfold for the population, along with a 
95% CI. 

b_ Draw a histogram of the variable triceps skinfold, using the weights. Do the data 
appear to be normally distributed? 

ce Find the minimum, 25th, 50th, and 75th percentiles, and maximum for the variable 
triceps skinfold. Calculate the same quantities separately for each gender (variable 
riagendr). Use these to construct side-by-side skeletal boxplots of the data as in 
Figure 7.7. 

d Construct a weighted circle plot with smoothed trend line for y variable triceps 
skinfold and x variable body mass index. Does there appear to be a linear rela- 
tionship? What other features do you see in the data? 


Answer the questions in Exercise 15, for y = waist circumference (variable bmxwaist) 
and x = thigh circumference (variable bmxthicr). 


The file ncvs2000.dat includes selected variables for a subset of data in the 2000 
NCVS. Using the data, find estimates of the following: 

a percentage of persons who are victims of a violent crime 

b_ percentage of persons who have been injured in a violent crime 

© average number of crime incidents per person 

d average medical expenses for persons who are injured. 


Give standard errors for your estimates. 


C. Working with Theory 


Trimmed means. Many statisticians recommend using trimmed means to estimate a 
population mean y,, if there are outliers. The procedure used to find an a-trimmed 
mean in an SRS of size n is to remove the largest naw observations and the smallest na 
observations, and then calculate the mean of the n(1 — 2@) observations that remain. 

Show that the w-trimmed mean for a finite population // of N observation units is 


> x0) 
Vos = NSVSQ2 ; 
> fo) 


NSVSQ 


where q; and q2 are the a and (1 —@) quantiles, respectively. Now propose an estimator 
of the population a-trimmed mean for data from a complex survey using F(y) and 


FQ). 


Probability-probability plots. A probability—probability plot (often referred to as a 
p-p plot) compares the empirical cdf from a sample with a specified theoretical 
cdf G such as the cdf of a normal distribution with specified mean and variance 
(Gnanadesikan, 1997). If the proposed cdf G describes the data well (including the 
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specification of the mean and variance), the points in a p—p plot of F (y) vs. G(y) will 
lie approximately on a straight line with intercept 0 and slope 1. 

Construct a p-p plot for the height data in htstrat.dat, used in Example 7.3. Use a 
normal distribution for G, with the mean and standard deviation estimated from the 
sample. Draw in the line with intercept 0 and slope 1. Is G a reasonable distribution 
to use to summarize the data? 


Quantile—quantile plots. Quantile—quantile plots are often used to assess how well a 
theoretical probability distribution fits a data set (Chambers et al., 1983). To construct 
a quantile—quantile plot from an SRS of size n, order the sample values so that yi) < 
ya) < ... < Wn. Then, to compare with a continuous theoretical cdf G, calculate 
Xi) = G~'[(i — 0.375) /(n + 0.25)] and plot yq vs. x fori = 1,...,n. If Gis a good 
fit for the data, the quantile—quantile plot will approximate a straight line. 

To use a quantile—quantile plot with survey data, let w(1),..., Wi) be the weight 
values corresponding to the ordered sample yi) < y2) < ... < Wn). Let 


Then plot yj; vs. G~!(vq) and assess whether the values appear to be approximately 
on a straight line. 

Figure 7.24 shows a histogram and quantile—quantile plot with G a standard normal 
cdf, for the body mass index data used in Section 7.4.2. SAS code used to produce these 
plots is on the website. The histogram displays a skewed distribution. The curvature in 
the quantile—quantile plot also indicates the skewness, since observations on the left 
are more compressed and observations on the right are more extreme than we would 
expect for normally distributed data. If a normal distribution described the data well, 
we would expect to see the points following a straight line. 


a Show that the plot of yj vs. G~!(vj) gives the SRS quantile-quantile plot when 
the sample is self-weighting. 
b Construct a quantile—quantile plot for the height data in htstrat.dat, used in Exam- 


ple 7.3. Use a standard normal cdf for G. Do you think the normal distribution 
describes these data well? 


Show that in a stratified sample, }~ yf (y) produces the estimator in (3.2). 
What is S? in (7.3) for an SRS? How does it compare with the sample variance s?? 


Consider a probability sample S of n observation units from a population U/ of N 
observation units. The weights are w; = 1/2, where z; is the probability that unit i 
is in the sample. Now let S) be a subsample of S of size n2, with units selected with 
probability proportional to w;. Show that S> is a self-weighting sample from U/. 


In a two-stage cluster sample of rural and urban areas in Nepal, Rothenberg et al. 
(1985) found that the design effect for common contagious diseases was much higher 
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FIGURE 7.24 
Histogram and normal quantile—quantile plot of body mass index from NHANES. 
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than for rare contagious diseases. In the urban areas measles, with an estimated inci- 
dence of 123.9 cases per 1000 children per year, had a design effect of 7.8; diphtheria, 
with an estimated incidence of 2.1 cases per 1000 children per year, had a design effect 
of 1.9. 

Explain why one would expect this disparity in the design effects. (HINT: Suppose 
a sample of 1000 children is taken, in 50 clusters of 20 children each. Also suppose 
that the disease is as aggregated as possible, so if the estimated incidence were 40 
per 1000, all children in two clusters would have the disease, and no children in 
the remaining 38 clusters would have the disease. Now calculate deff for incidences 
varying from | per 1000 to 200 per 1000.) 


The British Crime Survey (BCS) is also a stratified, multistage survey (AyeMaung, 
1995). In contrast to the NCVS, the BCS is not designed to be approximately self- 
weighting, as inner-city areas are sampled at about twice the rate of non-inner-city 
areas. In the BCS, households are selected using probability sampling, but only one 
adult (selected at random) is interviewed in each responding household. Set the relative 
sampling weight for an inner-city household to be 1. 


a_ Consider the BCS as a sample of households. What is the relative sampling weight 
for a non-inner-city household? 


b Consider the BCS as a sample of adults. Construct a table of relative sampling 
weights for the sample of adults. 


Number of Adults Inner City Not Inner City 


nNnbBWNe 


D. Projects and Activities 


Obtain one of the papers listed the file chapter7papers.html on the book website, or 
another paper employing a complex survey design, and write a short critique. Your 
critique should include: 

a a brief summary of the design and analysis 


b a discussion of the effectiveness of the design and the appropriateness of the 
analysis 


© your recommendations for future studies of this type. 


Trucks. The U.S. Vehicle Inventory and Use Survey (VIUS) was described in Exer- 
cise 34 of Chapter 3. 


a_ Draw a histogram, using the weights, of the number of miles driven (variable 
miles_annl) for the five truck class strata. 


b Draw side-by-side boxplots, using the weights, of miles per gallon (MPG) for 
each class of gross vehicle weight (vius_gvw). 
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c Draw two of the scatterplots that incorporate weights, described in Section 7.4.2, 
for y variable miles_annl and x variable model year (adm_modelyear). How do 
these differ from scatterplots that do not use the weights? 


IPUMS exercises. 


a_ Use the file ipums.dat to select a two-stage stratified cluster sample from the 
population. Select two psus from each stratum, with probability proportional to 
size. Then take a simple random sample of persons from each selected psu; use 
the same subsampling size within each psu. Your final sample should have about 
1200 persons. 


b Construct the column of sampling weights for your sample. 
ce Draw a histogram of the variable inctot for your sample, using the weights. 


d Construct side-by-side boxplots of the variable inctot for each level of marital 
status (variable marstat). 


e Draw two of the scatterplots that incorporate weights, described in Section 7.4.2, 
for y variable inctot and x variable age. How do these differ from scatterplots that 
do not use the weights? 


f Using the sample you selected, estimate the population mean of inctot and give 
the standard error of your estimate. Also estimate the population total of inctot 
and give its standard error. 


g Compare your results with those from an SRS with the same number of per- 
sons. Find the design effect of your response (the ratio of your variance from the 
unequal-probability sample to the variance from the SRS). 


Baseball data. Use the two-stage sample from Exercise 37 of Chapter 5 for this 
exercise. 


a_ Draw a histogram of the variable salary for your sample, using the weights. 
b Construct side-by-side boxplots of the variable salary for each position. 


ce Draw two of the scatterplots that incorporate weights, described in Section 7.4.2, 
for y variable salary and x variable number of games played (g). How do these 
differ from scatterplots that do not use the weights? 


d Draw two of the scatterplots that incorporate weights, described in Section 7.4.2, 
for y variable salary and x variable number of home runs (hr). What do you see 
in these graphs? 

e Draw quantile-quantile plots (see Exercise 20) for the variable salary and 
log(salary). Does either variable appear to follow, approximately, a normal dis- 
tribution? 


Many governmental statistical organizations and other collectors of survey data have 
websites providing information on the survey design. Some of these organizations 
and their Internet addresses are listed in the file chapter7websites.html on the book 
website. The first site listed, www.fedstats.gov, provides links to U.S. government 
agencies that spend at least $500,000 per year on statistical activities. Many of these 
agencies conduct surveys. The second listing from the International Statistical Institute 
(ISI) provides a directory to official statistical agencies throughout the world. 
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Look up a website describing a complex survey. Write a summary of the purpose, 
design, and method used for analysis. Do you think that the design used could be 
improved upon? If so, how? 


Activity for course project. Find a survey data set that has been collected by a federal 
government or large survey organization. Many of these are now available online, 
and contain information about stratification and clustering that you can use to calcu- 
late standard errors of survey estimates. Some examples in the United States include 
the NCVS, the National Health Interview Survey, the Current Population Survey, 
the Commercial Buildings Energy Consumption Survey, and the General Social Sur- 
vey. You can find a survey by selecting a topic from www.fedstats.gov and follow- 
ing the links to the survey data. Many of the other organizations listed in the file 
chap7websites.html on the book website (see Exercise 30) also provide survey data 
online. 

Read the documentation for the survey. What is the survey design? What stratifi- 
cation and clustering variables are used? (Sometimes the stratification and clustering 
variables are difficult to find in the documentation; look for variables containing “psu” 
or “str” in the name. These are often near the beginning or end of the variable listing 
in the codebook. Some surveys do not release stratification and clustering information 
to protect the confidentiality of data respondents, so make sure your survey provides 
that information.) 

Select response variables that you are interested in. If possible, find at least one 
response with continuous response. Draw a histogram, using the final weight variable, 
for that response. Use the weights to estimate the summary statistics of the mean and 
25th, 50th, and 75th percentiles. We’ll return to this data set in subsequent chapters 
so that you will have an opportunity to study bivariate and multivariate relationships 
among your variables of interest. 


This page intentionally left blank 


Nonresponse 


Miss Schuster-Slatt said she thought English husbands were lovely, and that she was preparing a 
questionnaire to be circulated to the young men of the United Kingdom, with a view to finding out their 
matrimonial preferences. 

“But English people won't fill up questionnaires,” said Harriet. 

“Won't fill up questionnaires?” cried Miss Schuster-Slatt, taken aback. 

“No,” said Harriet, “they won't. As a nation we are not questionnaire-conscious.” 


—Dorothy Sayers, Gaudy Night 


The best way to deal with nonresponse is to prevent it. After nonresponse has occurred, 
it is sometimes possible to construct models to predict the missing data, but predict- 
ing the missing observations is never as good as observing them in the first place. 
Nonrespondents often differ in critical ways from respondents; if the nonresponse 
rate is not negligible, inference based only upon the respondents may be seriously 
flawed. 

We discuss two type of nonresponse in this chapter: unit nonresponse, in which 
the entire observation unit is missing, and item nonresponse, in which some measure- 
ments are present for the observation unit but at least one item is missing. In a survey 
of persons, unit nonresponse means that the person provides no information for the 
survey; item nonresponse means that the person does not respond to a particular item 
on the questionnaire. In the Current Population Survey (CPS) and the National Crime 
Victimization Survey (NCVS), unit nonresponse can arise for a variety of reasons: 
The interviewer may not be able to contact the household; the person may be ill and 
cannot respond to the survey; the person may refuse to participate in the survey. In 
these surveys, the interviewer tries to get demographic information about the non- 
respondent such as age, sex, and race, as well as characteristics of the dwelling unit 
such as urban/rural status; this information can be used later to try to adjust for the 
nonresponse. Item nonresponse occurs largely because of refusals: A household may 
decline to give information about income, for example. 

In agriculture or wildlife surveys, the term missing data is generally used instead 
of nonresponse, but the concepts and remedies are similar. In a survey of breeding 
ducks, for example, some birds will not be found by the researchers; they are, in a 
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sense, nonrespondents. The nest may be raided by predators before the investigator 
can determine how many eggs were laid; this is comparable to item nonresponse. 
Lesser and Kalsbeek (1999) discuss nonresponse and other nonsampling errors in 
environmental surveys. 

In this chapter, we discuss four approaches to dealing with nonresponse: 


1 Prevent it. Design the survey so that nonresponse is low. This is by far the best 
method. 


2 Take a representative subsample of the nonrespondents; use that subsample to 
make inferences about the other nonrespondents. 


3 Use a model to predict values for the nonrespondents. Weighting class adjustment 
methods implicitly use a model to adjust for unit nonresponse. Imputation often 
adjusts for item nonresponse, and parametric models may be used for either type 
of nonresponse. 


4 Ignore the nonresponse (not recommended, but unfortunately common in 
practice). 


Effects of Ignoring Nonresponse 


EXAMPLE 8.1 


Thomsen and Siring (1983) report results from a 1969 survey on voting behavior 
carried out by the Central Bureau of Statistics in Norway. In this survey, three calls 
were followed by a mail survey. The final nonresponse rate was 9.9%, which is often 
considered to be a small nonresponse rate. Did the nonrespondents differ from the 
respondents? 

In the Norwegian voting register, it was possible to find out whether a person 
voted in the election. The percentage of persons who voted could then be compared 
for respondents and nonrespondents; Table 8.1 shows the results. The selected sample 
is all persons selected to be in the sample, including data from the Norwegian voting 
register for both respondents and nonrespondents. 

The difference in voting rate between the nonrespondents and the selected 
sample was largest in the younger age groups. Among the nonrespondents, the 
voting rate varied with the type of nonresponse. The overall voting rate for the 


TABLE 8.1 
Percentage of Persons Who Voted 


Age 
All 20-24 25-29 30-49 50-69 70-79 
Nonrespondents 71 59 56 72 78 74 
Selected sample 88 81 84 90 91 84 


Source: Adapted from table 8 in Thomsen and Siring (1983). 
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persons who refused to participate in the survey was 81%, the voting rate for 
the not-at-homes was 65%, and the voting rate for the mentally and physically ill 
was 55%, implying that absence or illness were the primary causes of nonresponse 
bias. s 


It has been demonstrated repeatedly that nonresponse can have large effects on 
the results of a survey—in Example 8.1, a nonresponse rate of less than 10% led to an 
overestimate of the voting rate in Norway. Holt and Elliot (1991, p. 334) discuss the 
results of a series of studies done on nonresponse in the United Kingdom indicating 
that “lower response rates are associated with the following characteristics: London 
residents; households with no car; single people; childless couples; older people; 
divorced/widowed people; new Commonwealth origin; lower educational attainment; 
self-employed.” 

Moreover, increasing the sample size without targeting nonresponse does noth- 
ing to reduce nonresponse bias; a larger sample size merely provides more obser- 
vations from the class of persons that would respond to the survey. Increasing the 
sample size may actually worsen the nonresponse bias, as the larger sample size 
may divert resources that could have been used to reduce or remedy the nonre- 
sponse, or it may result in less care in the data collection. Recall that the infamous 
Literary Digest Survey of 1936, discussed in Example 1.1, had 2.4 million respon- 
dents but a response rate of less than 25%. The U.S. decennial census itself does not 
include the entire population, and the undercoverage rate varies for different demo- 
graphic groups. Mulry (2004) discusses issues in measuring the undercoverage in the 
U.S. Census. 

Most small surveys ignore any nonresponse that remains after callbacks and 
follow-ups, and report results based on complete records only. Hite (1987) did so 
in the survey discussed in Chapter 1, and much of the criticism of her results was 
based on her low response rate. Nonresponse is also ignored for many surveys reported 
in newspapers, both local and national. 

An analysis of complete records has the underlying assumptions that the non- 
respondents are similar to the respondents, and that units with missing items are 
similar to units that have responses for every question. Much evidence indicates that 
this assumption does not hold true in practice. If nonresponse is ignored in the NCVS, 
for example, victimization rates are underestimated. Biderman and Cantor (1984) find 
lower victimization rates for persons who respond in three consecutive interviews than 
for persons who are nonrespondents in at least one of those interviews or who move 
before they complete the panel. 

Results reported from an analysis of only complete records should be taken 
as representative of the population of persons who would respond to the survey, 
which is rarely the same as the target population. If you insist on estimating 
population means and totals using only the complete records and making no adjust- 
ment for nonrespondents, at the very least you should report the rate of 
nonresponse. 

The main problem caused by nonresponse is potential bias. Think of the popu- 
lation as being divided into two somewhat artificial strata of respondents and nonre- 
spondents. The population respondents are the units that would respond if they were 
chosen to be in the sample; the number of population respondents, Nr, is unknown. 
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Similarly, the Ny (M for missing) population nonrespondents are the units that would 
not respond. We then have the following population quantities: 


Stratum Size Total Mean Variance 
Respondents Nr tr Yru Sk 
Nonrespondents Nu tu Yuu Su 
Entire population N t Vy S? 


The population as a whole has variance S? = ee (vi — ¥y/(N — 1), mean 
yy, and total t. A probability sample from the population will likely contain some 
respondents and some nonrespondents. But, of course, on the first call we do not 
observe y; for any of the units in the nonrespondent stratum. If the population mean 
in the nonrespondent stratum differs from that in the respondent stratum, estimating 
the population mean using only the respondents will produce bias.! 

Let yp be an approximately unbiased estimator of the mean in the respondent 
stratum, using only the respondents. As 


the bias is approximately 


= _ Nu = 
E[¥r] —Yy © yy Oru Yuu): 


The bias is small if either (1) the mean for the nonrespondents is close to the mean for 
the respondents, or (2) Ny/N is small—there is little nonresponse. But we can never 
be assured of (1), as we generally have no data for the nonrespondents. Minimizing 
the nonresponse rate is the only sure way to control nonresponse bias. 


B.2 


Designing Surveys to 
Reduce Nonsampling Errors 


A common feature of poor surveys is a lack of time spent on the design and nonre- 
sponse follow-up in the survey. Many persons new to surveys (and some, unfortu- 
nately, not new) simply jump in and start collecting data without considering potential 
problems in the data collection process; they mail questionnaires to everyone in the 
target population and analyze those that are returned. It is not surprising that such 
surveys have poor response rates. Some surveys reported in academic journals on 
purchasing, for example, have response rates between 10 and 15%. It is difficult to 
see how anything can be concluded about the population in such a survey. 


'The variance is often too low as well. In income surveys, for example, the rich and the poor are more 
likely to be nonrespondents on the income questions. In that case, Se: for the respondent stratum, is 


smaller than S?. The point estimator of the mean may be biased, and the variance estimator may be 
biased, too. 


EXAMPLE 8.2 
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A researcher who knows the target population well will be able to anticipate some 
of the reasons for nonresponse and prevent some of it. Most investigators, however, 
do not know as much about reasons for nonresponse as they think they do. They 
need to discover why the nonresponse occurs and resolve as many of the problems as 
possible before commencing the survey. 

These reasons can be discovered through designed experiments and application of 
quality improvement methods to the data collection and processing. You do not know 
why previous surveys related to yours have a low response rate? Design an experiment 
to find out. You think errors are introduced in the data recording and processing? Use 
a nested design to find the sources of errors. Books on quality improvement or design 
of experiments such as Montgomery (2000) or Oehlert (2000) will tell you how to 
collect your data. 

And, of course, you can rely on previous researchers’ experiments to help you 
minimize nonsampling errors. The references on design of experiments and quality 
control in Chapter 15 are a good place to start; Hidiroglou et al. (1993) give a general 
framework for nonresponse. 


The 1990 U.S. decennial census attempted to survey each of the over 100 million 
households in the United States. The response rate for the mail survey was 65%; 
households that did not mail in the questionnaire needed to be contacted in person, 
adding millions of dollars to the cost of the census. Increasing the mail response rate 
for future censuses would result in tremendous savings. 

Dillman et al. (1995) report results of a factorial experiment employed in the 
1992 Census Implementation Test, designed to explore the individual effects and 
interactions of three experimental factors on response rates. The three factors were: 
(1) a prenotice letter alerting the household to the impending arrival of the census 
form, (2) a stamped return envelope included with the census form, and (3) a reminder 
postcard sent a few days after the census form. The results were dramatic, as shown 
in Figure 8.1. The experiment established that while all three factors influenced the 
response rate, the letter and postcard led to greater gains in response rate than the 
stamped return envelope. = 


Nonresponse can have many different causes; as a result, no single method can 
be recommended for every survey. Platek (1977) classifies sources of nonresponse 
as related to (1) survey content, (2) methods of data collection, and (3) respondent 
characteristics, and illustrated various sources using the diagram in Figure 8.2. Groves 
(1989) and Dillman et al. (2009) discuss additional sources of nonresponse. The 
following are some factors that may influence response rate and data accuracy. 


a Survey content. A survey on drug use or financial matters may have a large number 
of refusals. Sometimes the response rate can be increased for sensitive items by 
careful ordering of the questions, by using a randomized response technique (see 
Section 15.4), or by using a self-administered questionnaire on the computer to 
preserve the respondents’ privacy. 


a Time of survey. Some calling periods or seasons of the year may yield higher 
response rates than others. The vacation month of August, for example, would be 
a bad time to take a one-time household survey in Germany. 


Ja4 


FIGURE 8.1 


Response rates achieved for each combination of the factors letter, envelope, and postcard. The 
observed response rate was 64.3% when all three aids were used and only 50% when none 


were used. 


FIGURE 8.2 
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Factors affecting nonresponse 


Source: “Some Factors Affecting Non-Response,” by R. Platek, 1977, Survey Methodology, 3, 191-214. 


Copyright © 1977 Survey Methodology. Reprinted with permission. 
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Interviewers. Gower (1979) found a large variability in response rates achieved 
by different interviewers, with about 15% of interviewers reporting almost no 
nonresponse. Some field investigators in a bird survey may be better at spotting 
and identifying birds than others. Quality improvement methods can be applied 
to increase the response rate and accuracy for interviewers. The same methods 
can be applied to the data-coding process. These methods will be discussed in 
Chapter 15. 


Data-collection method. Generally, telephone and mail surveys have a lower 
response rate than in-person surveys (they also have lower costs, however). 
Computer-Assisted Telephone Interviewing (CATI) and Computer-Assisted 
Personal Interviewing (CAPI) have been demonstrated to improve accuracy of 
data collected in telephone and in-person surveys; with CATI and CAPI, all ques- 
tions are displayed on a computer and the interviewer codes the responses in 
the computer as questions are asked. CATI and CAPI are especially helpful in 
surveys in which a respondent’s answer to one question determines which ques- 
tion is asked next (Catlin et al., 1988). Many telephone surveys have reported a 
decrease in response rates in recent years (Curtin et al., 2005); some of the decline 
in telephone response rates is attributed to recent technology that allows people 
to screen calls. 

Mail, fax, and Internet surveys often have low response rates. Possible reasons 
for nonresponse in a mail survey should be explored before the questionnaire is 
mailed: Is the survey sent to the wrong address? Do recipients discard the envelope 
as junk mail even before opening it? Will the survey reach the intended recipient? 
Will the recipient believe that filling out the survey is worth the time? 

A survey conducted by an interviewer often has less item nonresponse than a 
self-administered questionnaire (Tourangeau et al., 2000). A person filling out a 
paper survey can skip questions more easily than a person who is prompted by an 
interviewer. Computer-assisted self-administered questionnaires can sometimes 
be designed so participants must provide an answer to all questions; that does not 
mean the answers are always truthful, however. 


Questionnaire design. We have already seen in Chapter | that question wording 
has a large effect on the responses received; it can also affect whether a person 
responds to an item on the questionnaire. Beatty and Herrmann (2002) review 
research on the application of cognitive research on questionnaire design. In a 
mail or Internet survey, a well-designed form for the respondent may increase 
data accuracy and reduce item nonresponse (Dillman, 2008). 


Respondent burden. Persons who respond to a survey are doing you an immense 
favor, and the survey should be as nonintrusive as possible. A shorter question- 
naire, requiring less detail, may reduce the burden to the respondent. Respondent 
burden is a special concern in panel surveys such as the NCVS, in which sam- 
pled households are interviewed every six months for 34 years. DeVries et al. 
(1996) discuss methods used in reducing respondent burden in the Netherlands. 
Techniques such as stratification can reduce respondent burden because a smaller 
sample suffices to give the required precision. Raghunathan and Grizzle (1995) 
use a split-questionnaire design, in which subsets of the survey respondents are 
given different subsets of the questionnaire, to reduce respondent burden. With a 
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split-questionnaire design, each individual receives a shortened questionnaire yet 
every question is administered to at least a subsample of respondents. 


a Survey introduction. The survey introduction provides the first contact between 
the interviewer and potential respondent; a good introduction, giving the recipi- 
ent motivation to respond, can increase response rates dramatically. The Nielsen 
Company emphasizes to households in their selected sample that their participa- 
tion in the Nielsen ratings affects which television shows are aired. The respondent 
should be told for what purpose the data will be used (unscrupulous persons often 
pretend to be taking a survey when they are really trying to attract customers or 
converts), and assured confidentiality. 


« Incentives and disincentives. Incentives, financial or otherwise, may increase the 
response rate (Singer, 2002). Disincentives may work as well: Physicians who 
refused to be assessed by peers after selection in a stratified sample from the 
College of Physicians and Surgeons of Ontario registry had their medical licenses 
suspended. Not surprisingly, nonresponse was low (McAuley et al., 1990). 


s Follow-up. The initial contact of the sample is usually less costly per unit than 
follow-ups of the nonrespondents. If the initial survey is by mail, a reminder may 
increase the response rate. Not everyone responds to follow-up calls, though; some 
persons will refuse to respond to the survey no matter how often they are contacted. 
You need to decide how many follow-up calls to make before the marginal returns 
do not justify the money spent. 


You should try to obtain at least some information about nonrespondents that can 
be used later to adjust for the nonresponse, and include surrogate items that can be 
used for item nonresponse. True, there is no complete compensation for not having 
the data, but partial information may be better than none. Information about the race, 
sex, or age of a nonrespondent may be used later to adjust for nonresponse. Questions 
about income may well lead to refusals, but questions about cars, employment, or 
education may be answered and can be used to predict income. If the pretests of the 
survey indicate a nonresponse problem that you do not know how to prevent, try to 
design the survey so that at least some information is collected for each observation 
unit. 

The quality of survey data is largely determined at the design stage. Fisher’s 
(1938) words about experiments apply equally well to the design of sample surveys: 
“To call in the statistician after the experiment is done may be no more than asking 
him to perform a postmortem examination: he may be able to say what the experiment 
died of.” Any survey budget needs to allocate sufficient resources for survey design 
and for nonresponse follow-up. Do not scrimp on the survey design; every hour spent 
on design may save weeks of remorse later. 


Callbacks and Two-Phase Sampling 


Virtually all good surveys rely on callbacks to obtain responses from persons not at 
home for the first try. Analysis of callback data can provide some information about 
the biases that can be expected from the remaining nonrespondents. 


EXAMPLE 8.3 
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Traugott (1987) analyzed callback data from two 1984 Michigan polls on preference 
for presidential candidates. The overall response rates for the surveys were about 65%, 
typical for large political polls. About 21% of the interviewed sample responded on 
the first call; up to 30 attempts were made to reach persons who did not respond on 
the first call. Traugott found that later respondents were more likely to be male, older, 
and Republican than early respondents; while 48% of the respondents who answered 
the first call supported Reagan and 45% supported Mondale, 59% of the entire sample 
supported Reagan as opposed to 39% for Mondale. Differing procedures for nonre- 
sponse follow-up and persistence in callback may explain some of the inconsistencies 
among political polls. 

If nonrespondents resemble late respondents, one might speculate that nonre- 
spondents were more likely to favor Reagan. But nonrespondents do not necessarily 
resemble the hard-to-reach; persons who absolutely refuse to participate may differ 
greatly from persons who could not be contacted immediately, and nonrespondents 
may be more likely to have illnesses or other circumstances preventing participation. 
We also do not know how likely it is that nonrespondents to the surveys will vote in 
the election; even if we speculate that they were more likely to favor Reagan, they 
are not necessarily more likely to vote for Reagan. 


Often, when the survey is designed so that callbacks will be used, the initial contact 
is by mail survey; the follow-up calls use a more expensive method such as a personal 
interview. 

Hansen and Hurwitz (1946) proposed subsampling the nonrespondents and using 
two-phase sampling (also called double sampling) for stratification to estimate the 
population mean or total. The population is divided into two strata, as described in 
Section 8.1; the two strata are respondents and initial nonrespondents, persons who 
do not respond to the first call. We shall develop the theory of two-phase sampling 
for general survey designs in Chapter 12; here, we illustrate how it can be used for 
nonresponse. 

In the simplest form of two-phase sampling, randomly select n units in the popu- 
lation. Of these, nr respond and ny do not respond. The values nz and ny, though, are 
random variables; they will change if a different SRS is selected. Then make a second 
call on a random subsample of 100v% of the ny nonrespondents in the sample, where 
the subsampling fraction v does not depend on the data collected. 

Suppose that through some superhuman effort all of the targeted nonrespondents 
are reached. Let yp be the sample average of the original respondents, and y,, (M@ 
stands for “missing’’) be the average of the subsampled nonrespondents. The two- 
phase sampling estimators of the population mean and total are 


A NR_ NM _ 
y= —Yr+—Yy (8.1) 
n n 
and 
‘ ~ N Nil 
P=Ny=— Di yit—— Diy (8.2) 
icSr ieSu 


where Spr represents the sampled units in the respondent stratum and Sy represents 
the sampled units in the nonrespondent stratum. Note that 7 is a weighted sum of 
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the observed units; the weights are N/n for the respondents and N/(nv) for the 
subsampled nonrespondents. Because only a subsample was taken in the nonrespon- 
dent stratum, each subsampled unit in that stratum represents more units in the pop- 
ulation than does a unit in the respondent stratum. 

The expected value and variance of these estimators are given in Chapter 12. 
From (12.8), if the finite population corrections can be ignored, we can estimate the 
variance by 


nei nr — 152 nu — 1 sy 1 


NR _ ae) NM | 
n—-l1n n—1 un n-Il OR =D) Ou —y)' |. 


n n 

If everyone responds in the subsample, two-phase sampling not only removes 
the nonresponse bias but also accounts for the original nonresponse in the estimated 
variance. 


b.4 


Mechanisms for Nonresponse 


Most surveys have some residual nonresponse even after careful design and 
follow-up of nonrespondents. All methods for fixing up nonresponse are necessarily 
model-based. If we are to make any inferences about the nonrespondents, we must 
assume that they are related to respondents in some way. 

Dividing population members into two fixed strata of would-be respondents and 
would-be nonrespondents, as in Section 8.1, is fine for thinking about potential non- 
response bias and for two-phase methods. To adjust for nonresponse that remains 
after all other measures have been taken, we need a more elaborate setup. Define the 
random variable 


R= 1 if unit 7 responds 
‘| 0 if unit i does not respond. 


After sampling, the realizations of the response indicator variable are known for the 
units selected in the sample. A value for y; is recorded if r;, the realization of R;, is 1. 
The probability that a unit selected for the sample will respond, 


$i = P(Ri = 1), 


is of course unknown but assumed positive. Rosenbaum and Rubin (1983) call ¢; the 
propensity score for the ith unit. 

Suppose that y; is a response of interest, and that x; is a vector of information 
known about unit i in the sample. Information used in the survey design is included 
in x;. We consider three types of missing data, using the Little and Rubin (2002) 
terminology of nonresponse classification. 


Missing Completely atRandom If 4; does not depend on x;, y;, or the survey design, the 
missing data are missing completely at random (MCAR). Such a situation occurs if, 
for example, someone at the laboratory drops a test tube containing the blood sample 
of one of the survey participants—there is no reason to think that the dropping of the 
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test tube had anything to do with the white blood cell count.” If data are MCAR, the 
respondents are representative of the selected sample. 

Missing data in the NCVS would be MCAR if the probability of nonresponse is 
completely unrelated to region of the United States, race, sex, age, or any other variable 
measured for the sample, and if the probability of nonresponse is unrelated to any 
variables about victimization status. Nonrespondents would be essentially selected at 
random from the sample. 

If the response probabilities ¢; are all equal and the events {R; = 1} are condition- 
ally independent of each other and of the sample selection process given np, then the 
data are MCAR. If an SRS of size n is taken, then under this mechanism the respon- 
dents will be a simple random subsample of variable size ng. The sample mean of 
the respondents, Vp, is approximately unbiased for the population mean. The MCAR 
mechanism is implicitly adopted when nonresponse is ignored. 


Missing at Random Given Covariates If ¢; depends on x; but not on y;, the data are 
missing at random (MAR); the nonresponse depends only on observed variables. 
We can successfully model the nonresponse, since we know the values of x; for all 
sample units. Persons in the NCVS would be missing at random if the probability of 
responding to the survey depends on race, sex, and age—all known quantities—but 
does not vary with victimization experience within each age/race/sex class. This is 
sometimes termed ignorable nonresponse: Ignorable means that a model can explain 
the nonresponse mechanism and that the nonresponse can be ignored after the model 
accounts for it, not that the nonresponse can be completely ignored and complete-data 
methods used. 


Not Missing at Random If the probability of nonresponse depends on the value of 
a missing response variable, and cannot be completely explained by values in the 
observed data, then the nonresponse is not missing at random (NMAR). This is likely 
the situation for the NCVS: It is suspected that a person who has been victimized by 
crime is less likely to respond to the survey than a nonvictim, even if they share the 
values of all known variables such as race, age, and sex. Crime victims may be more 
likely to move after a victimization, and thus not be included in subsequent NCVS 
interviews. Models can help in this situation, because the nonresponse probability may 
also depend on known variables, but cannot completely adjust for the nonresponse. 
The probabilities of responding, ¢;, are useful for thinking about the type of 
nonresponse. Unfortunately, they are unknown, so we do not know for sure which 
type of nonresponse is present. We can sometimes distinguish between MCAR and 
MAR by fitting a model attempting to predict the observed probabilities of response 
for subgroups from known covariates; if the coefficients in a logistic regression model 
predicting nonresponse are significantly different from 0, the missing data are likely 
not MCAR. Distinguishing between MAR and NMAR is more difficult. In practice, 
we expect most nonresponse in surveys to be of the NMAR type. It is unreasonable 
to expect that we can construct a perfect model that will completely explain the 


2Even here, though, the suspicious mind can create a scenario in which the nonresponse might be related 
to quantities of interest: Perhaps laboratory workers are less likely to drop test tubes that they believe 
contain HIV. 
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nonresponse mechanism. But we can try to reduce the bias due to nonresponse. In the 
next section, we discuss a method that is commonly used to estimate the ¢;’s. 


Weighting Methods for Nonresponse 


In previous chapters we have seen how weights can be used in calculating estimates 
for various sampling schemes (see Sections 2.4, 3.3, 5.3, and 7.2). The sampling 
weights are the reciprocals of the inclusion probabilities, so that an estimator of the 
population total is oe s Wii. For stratification, the weights are w; = N),/np if unit i 
is in stratum h; for sampling units with unequal probabilities, w; = 1/7;. 

Weights can also be used to adjust for nonresponse. Let Z; be the indicator variable 
for presence in the selected sample, with P(Z; = 1) = 7;. If R; is independent of Z;, 
then the probability that unit i will be measured is 


P(unit i selected in sample and responds) = jj. 


The probability of responding, ¢;, is estimated for each unit in the sample, using 
auxiliary information that is known for all units in the selected sample. The final 
weight for a respondent is then 1 / (Gi). Weighting methods assume that the response 
probabilities can be estimated from variables known for all units; they assume MAR 
data. 


8.5.1 Weighting Class Adjustment 


EXAMPLE 8.4 


Sampling weights w; have been interpreted as the number of units in the population 
represented by unit i of the sample. Weighting class methods extend this approach 
to compensate for nonsampling errors: Variables known for all units in the selected 
sample are used to form weighting adjustment classes, and it is hoped that respondents 
and nonrespondents in the same weighting adjustment class are similar. Weights of 
respondents in the weighting adjustment class are increased, so that the respondents 
represent the nonrespondents’ share of the population as well as their own. 


Suppose the age is known for every member of the selected sample and that person 
i in the selected sample has sampling weight w; = 1/z;. Then weighting classes can 
be formed by dividing the selected sample among different age classes, as Table 8.2 
shows. 

We estimate the response probability for each class by 


A sum of weights for respondents in class c 


~~ sum of weights for selected sample in class c’ 


Then the sampling weight for each respondent in class c is multiplied by 1/ bes the 
weight factor in Table 8.2. The weight of each respondent with age between 15 and 
24, for example, is multiplied by 1.622. Since there was no nonresponse in the over-65 
group, their weights are unchanged. = 


EXAMPLE 8.5 


TABLE 8.2 


Illustration of Weighting Class Adjustment Factors 


8.5 Weighting Methods for Nonresponse JAI 


Age 
15-24 25-34 35-44 = 45-64 65+ | Total 

Sample size 202 220 180 195 203 1000 
Respondents 124 187 162 187 203 863 
Sum of weights 30,322 33,013 27,046 29,272 30,451 | 150,104 

for sample 
Sum of weights 18,693 28,143 24,371 28,138 30,451 

for respondents 
be 0.6165 0.8525 0.9011 0.9613 1.0000 
Weight factor 1.622 1.173 1.110 1.040 1.000 


The probability of response is assumed to be the same within each weighting 
class, with the implication that within a weighting class, the probability of response 
does not depend on y. As mentioned earlier, weighting class methods assume MAR 
data. The weight for a respondent in weighting class c is 1/ (rie). 

To estimate the population total using weighting class adjustments, let x,; = 1 if 
unit 7 is in class c, and 0 otherwise. Then let the new weight for respondent i be 


where w; is the sampling weight for unit i; w; = wil bc if unit 7 is in class c. Assign 
w; = 0 if unit 7 is a nonrespondent. Then, 
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In an SRS, for example, if 1, is the number of sample units in class c, ner is the 
number of respondents in class c, and y.p is the average for the respondents in class 
c, then @. =Ner/Ne and 


The National Crime Victimization Survey. To adjust for individual nonresponse in 
the NCVS, the within-household noninterview adjustment factor (WHHNAF) of 
Chapter 7 is used. NCVS interviewers gather demographic information on the non- 
respondents, and this information is used to classify all persons into 24 weighting 
adjustment cells. The cells depend on the age of the person, the relation of the person 
to the reference person (head of household), and the race of the reference person. 
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For any cell, let Wr be the sum of the weights for the respondents, and Wy be 
the sum of the weights for the nonrespondents. Then the new weight for a respondent 
in a cell will be the previous weight multiplied by the weighting adjustment factor 
(Wau + Wr)/Wr. Thus the weights that would be assigned to nonrespondents are 
reallocated among respondents with similar (we hope) characteristics. 

A problem occurs if (Wy + Wr)/Wr is too large. If (Wy + Wr)/Wr > 2, the cell 
contains more nonrespondents than respondents. In this case, the variance of the 
estimate increases; if the number of respondents in the cell is small, the weight may 
not be stable. The Census Bureau collapses cells to obtain weighting adjustment 
factors of 2 or less. If there are fewer than 30 interviewed persons in a cell, or if 
the weighting adjustment factor is greater than 2, the cell is combined (collapsed) 
with neighboring cells until the collapsed cell has more than 30 observations and a 
weighting adjustment factor of 2 or less. m= 


Construction of Weighting Classes Weighting adjustment classes should be con- 
structed as though they were strata; as shown in the next section, weighting adjustment 
is similar to poststratification. If we could construct weighting classes so that in each 
weighting class c (a) the response variable y; is constant in class c, (b) the response 
propensity ¢; is the same for every unit in class c, or (c) the response y; is uncor- 
related with the response propensity ¢; in class c, then we would largely eliminate 
nonresponse bias for estimating population means and totals (see Exercise 17). 

Consequently, the weighting classes should be formed so that units within each 
class are as similar as possible with respect to the major variables of interest, and so 
that the response propensities vary from class to class but are relatively homogeneous 
within a class. At the same time, it is desirable to avoid very large weight adjustments. 
Eltinge and Yansaneh (1997) discuss methods for choosing the number of weighting 
classes to use. 


6.52 Poststratification 


Poststratification was introduced in Section 4.4; it is a form of ratio adjustment. To 
use poststratification to try to compensate for nonresponse, we modify the weights so 
that the sample is calibrated to population counts in the poststrata. Poststratification 
is similar to weighting class adjustment, except that population counts are used to 
adjust the weights. Suppose an SRS is taken. After the sample is collected, units are 
grouped into H different poststrata, usually based on demographic variables such as 
race or sex. The population has N;, units in poststratum h; of these, n_, were selected 
for the sample and njr responded. The poststratified estimator for y,, is 


H 


iS Nn_ 
Ypost = y WR? 
h=1 
the weighting class estimator for yy is 
H 


> Nh _ 
Ywe = ye Woe 
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The two estimators are similar in form; the only difference is that in poststratifica- 
tion, the N, are known while in weighting class adjustments the N;, are unknown 
and estimated by Nn;,/n. A variance estimator for poststratification will be given in 
Exercise 17 of Chapter 9. 


Poststratification Using Weights 


Ina general survey design, the sum of the weights ina subgroup, >); Ss, Wi 1S supposed 
to estimate the population count for that subgroup, Nj. Poststratification uses the ratio 
estimator within each subgroup to adjust by the true population count. 

Let x;; = 1 if unit 7 is a respondent in poststratum h, and 0 otherwise. Then let 


H 
r Nn 
W; = Wi oxi 
h=1 >, WiXhj 
JER 


where FR is the set of respondents in the sample. Using the modified weights, 


* 
) w;Xni = Na, 


icR 


and the poststratified estimator of the population total is 


ioe —_ > Wi yi. 
iceR 
Note that the modified weights w* depend on the particular sample selected. 
Poststratification adjusts for undercoverage as well as nonresponse if the pop- 
ulation count N;, includes individuals not in the sampling frame for the survey. As 
shown in Chapter 4, poststratification can reduce the variance of estimated population 
quantities by calibrating the survey to the known population counts. 


The second-stage factor in the NCVS (see Section 7.6) uses poststratification to adjust 
the weights. After all other weighting adjustments have been done, including the 
weighting class adjustments for nonresponse, poststratification is used to make the 
sample counts agree with estimates of the population counts from the Census Bureau. 
Each person is assigned to one of 72 poststrata based on the person’s age, race, and 
sex. The number of persons in the population falling in that poststratum, N;,, is known 
from other sources. Then, the weight for a person in poststratum / is multiplied by 


Nh 
sum of weights for all respondents in poststratum h’ 


With weighting classes, the weighting factor to adjust for unit nonresponse is always 
at least one. With poststratification, because weights are adjusted so that they sum to 
a known population total, the weighting factor can be any positive number, although 
weighting factors of two or less are desirable. = 


The poststratified estimator is approximately unbiased if within each poststratum 
h, (a) each unit has the same probability of responding, (b) the response propensity 
@; is the same for every unit, or (c) the response y; is uncorrelated with the response 
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propensity ¢; (see Exercise 18). These are big assumptions: To make them seem 
a little more plausible, survey researchers often use many poststrata. But a large 
number of poststrata may create additional problems, because poststrata with too few 
respondents may result in unstable estimates (Gelman and Carlin, 2002). If faced 
with poststrata with few observations, most practitioners collapse the poststrata with 
others that have similar means in key variables until they have a reasonable number 
of observations in each poststratum. For the CPS, a “reasonable” number means that 
each group has at least 20 observations and that the response rate for each group is at 
least 50%. 


Raking Adjustments 


Raking is a poststratification method that may be used when poststrata are formed 
using more than one variable, but only the marginal population totals are known. 
Raking was first used in the 1940 census to make sure that the complete census and 
samples taken from it gave consistent results and was introduced by Deming and 
Stephan (1940); Brackstone and Rao (1979) further developed the theory. 

Consider the following table of sums of weights from a sample; each entry in 
the table is the sum of the sampling weights for persons in the sample falling in 
that classification (for example, the sum of the sampling weights for black females 
is 300). 


Native Sum of 
Black White Asian American Other Weights 
Female 300 1200 60 30 30 1620 
Male 150 1080 90 30 30 1380 
Sum of 450 2280 150 60 60 3000 
Weights 


Now suppose we know the true population counts for the marginal totals: We know 
that the population has 1510 women and 1490 men, 600 blacks, 2120 whites, 150 
Asians, 100 Native Americans, and 30 persons in the “Other” category. The population 
counts for each cell in the table, however, are unknown; we do not know the number of 
black females in this population and cannot assume independence. Raking allows us 
to adjust the weights so that the sums of weights in the margins equal the population 
counts. 

First, adjust the rows. Multiply each entry by (true row population)/(estimated 
row population). Multiplying the cells in the female row by 1510/1620 and the cells 
in the male row by 1490/1380 results in the following table: 


Native Sum of 

Black White Asian American Other Weights 
Female 279.63 1118.52 55.93 27.96 27.96 1510 
Male 161.96 1166.09 97.17 32.39 32.39 1490 
Total 441.59 2284.61 153.10 60.35 60.35 3000 
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The row totals are fine now, but the column totals do not yet equal the population 
totals. Repeat the same procedure with the columns in the new table. The entries in 
the first column are each multiplied by 600/441.59. The following table results: 


Native Sum of 

Black White Asian American Other Weights 

Female 379.94 1037.93 54.51 46.33 13.90 1532.61 

Male 220.06 1082.07 94.70 53.67 16.10 1466.60 
Total 600.00 2120.00 150.00 100.00 30.00 3000 


But this has thrown the row totals off again. Repeat the procedure until both row 
and column totals equal the population counts. The procedure converges as long as 
all cell counts are positive. In this example, the final table of adjusted counts is: 


Native Sum of 

Black White Asian American Other Weights 
Female 375.59 1021.47 53.72 45.56 13.67 1510 
Male 224.41 1098.53 96.28 54.44 16.33 1490 
Total 600.00 2120.00 150.00 100.00 30.00 3000 


The entries in the last table may be better estimates of the cell populations (i.e., 
with smaller variance) than the original weighted estimates, simply because they use 
more information about the population. The weighting adjustment factor for each 
white male in the sample is 1098.53/1080; the weight of each white male is increased 
a little to adjust for nonresponse and undercoverage. Likewise, the weights of white 
females are decreased because they are overrepresented in the sample. 

The assumptions for raking are the same as for poststratification, with the addi- 
tional assumption that the response probabilities depend only on the row and col- 
umn and not on the particular cell. Raking has some difficulties—the algorithm 
may not converge if some of the cell estimates are zero. There is also a danger 
of “overadjustment’”—if there is little relation between the extra dimension in raking 
and the cell means, raking can increase the variance rather than decrease it. 


5 Advantages and Disadvantages of Weighting 
Adjustments 


Weighting class adjustments and poststratification can both help reduce nonresponse 
bias. The models for weighting adjustments for nonresponse are strong: In each 
weighting cell or poststratum, the respondents and nonrespondents are assumed to be 
similar, or each individual in a weighting class is assumed equally likely to respond 
to the survey or have a response propensity that is uncorrelated with y. These models 
never exactly describe the true state of affairs, and you should always consider their 
plausibility and implications. It is an unfortunate tendency of some survey practition- 
ers to treat the weighting adjustment as a complete remedy and then act as though 
there was no nonresponse. Weights may improve many of the estimates, but they rarely 
eliminate all nonresponse bias. If weighting adjustments are made (and remember, 
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making no adjustments is itself a model about the nature of the nonresponse), practi- 
tioners should always state the assumed response model and give evidence to justify 
it. Weighting adjustments are usually used for unit nonresponse, not for item nonre- 
sponse (which would require a different weight for each item). 

Poststratification is a special case of calibration methods in survey sampling. 
Deville and Sarndal (1992) and Sarndal (2007) describe the use of calibration methods 
to attempt to reduce nonresponse bias. We discuss general calibration methods in 
Section 11.7. 


Missing items may occur in surveys for several reasons: An interviewer may fail to 
ask a question; a respondent may refuse to answer the question or cannot provide 
the information; a clerk entering the data may skip the value. Sometimes, items with 
responses are changed to missing when the data set is edited or cleaned—a data editor 
may not be able to resolve the discrepancies for an individual 3-year-old who voted 
in the last election, and may set both values to missing. 

Imputation is commonly used to assign values to the missing items. A replace- 
ment value, often from another person in the survey who is similar to the item non- 
respondent on other variables, is imputed (filled in) for the missing value. When 
imputation is used, an additional variable should be created for the data set that indi- 
cates whether the response was measured or imputed. 

Imputation procedures are used not only to reduce the nonresponse bias but to 
produce a “clean,” rectangular data set—one without holes for the missing values. We 
may want to look at tables for subgroups of the population, and imputation allows us 
to do that without considering the item nonresponse separately each time we construct 
a table. 


The CPS has an overall high household response rate, but some households refuse 
to answer certain questions. The nonresponse rate is about 20% on many income 
questions. This nonresponse would create a substantial bias in any analysis unless 
some corrective action were taken: Various studies suggest that the item nonresponse 
for the income items is highest for low-income and high-income households. Imputa- 
tion for the missing data makes it possible to use standard statistical techniques such 
as regression without the analyst having to treat the nonresponse by using specially 
developed methods. For surveys such as the CPS, if imputation is to be done, the 
agency collecting the data has more information to guide it in filling in the miss- 
ing values than does an independent analyst, because identifying information is not 
released on the public use tapes. 

The CPS uses weighting for noninterview adjustment, and deductive and hot-deck 
imputation for item nonresponse. The sample is divided into classes using variables 
sex, age, race, and other demographic characteristics. If an item is missing, a corre- 
sponding item from another unit in that class is substituted. The classifications differ 
for different items; for imputing weekly earnings, several thousand classes are formed 
from demographic characteristics as well as occupation and education (U.S. Census 
Bureau, 2006a). = 
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TABLE 8.3 
Small Data Set Used to Illustrate Imputation Methods 


Years of Crime Violent Crime 
Person Age Sex Education Victim? Victim? 
1 47 M 16 0 0 
2 45 F ? 1 1 
3 19 M 11 0 0 
4 21 F ? 1 1 
5 24 M 12 1 1 
6 41 F Y 0 0 
7 36 M 20 1 ? 
8 50 M 12 0 0 
9 53 F 13 0 ? 
10 17 M 10 ? ? 
11 53 F 12 0 0 
12 21 F 12 0 0 
13 18 F 11 1 ? 
14 34 M 16 1 0 
15 44 M 14 0 0 
16 45 M 11 0 0 
17 54 F 14 0 0 
18 55 F 10 0 0 
19 29 F 12 ? 0 
20 32 F 10 0 0 


We use the small data set in Table 8.3 to illustrate some of the different methods 
for imputation. This artificial data set is only used for illustration; in practice, a much 
larger data set is needed for imputation. A “1” means the respondent answered yes to 
the question. 


8.6.1 Deductive Imputation 


Some values may be imputed in the data editing, using logical relations among the 
variables. Person 9 is missing the response for whether she was a victim of violent 
crime. But she had responded that she was not a victim of any crime, so the violent 
crime response should be changed to 0. 

Deductive imputation may sometimes be used in longitudinal surveys. If a woman 
has two children in year | and two children in year 3, but is missing the value for year 
2, the logical value to impute would be two. 


8.62 Cell Mean Imputation 


Respondents are divided into classes (cells) based on known variables, as in weighting 
class adjustments. Then the average of the values for the responding units in cell c, 
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Yr,» 1S Substituted for each missing value. Cell mean imputation assumes that missing 
items are missing completely at random within the cells. 


The four cells for our example are constructed using the variables age and sex. (In 
practice, of course, you would want to have many more individuals in each cell.) 


Age 
< 34 > 35 
M Persons Persons 
3,5, 10, 14 1, 7, 8, 15, 16 
Sex 
F Persons Persons 
4, 12, 13, 19, 20 2, 6,9, 11, 17, 18 


Persons 2 and 6, missing the value for years of education, would be assigned the mean 
value for the four women aged 35 or older who responded to the question: 12.25. The 
mean for each cell after imputation is the same as the mean of the respondents. The 
imputed value, however, is not one of the possible responses to the question about 
education. m= 


Mean imputation gives the same point estimates for means, totals, and propor- 
tions as the weighting class adjustments. Mean imputation methods fail to reflect the 
variability of the nonrespondents, however—all missing observations in a class are 
given the same imputed value. The distribution of y will be distorted because of a 
“spike” at the value of the sample mean of the respondents. As a consequence, the 
estimated variance in the subclass will be too small. 

To avoid the spike, a stochastic cell mean imputation could be used. If the 
response variable were approximately normally distributed, the missing values could 
be imputed with a randomly generated value from a normal distribution with mean 
yz and standard deviation Ss-p. 

Mean imputation, stochastic or otherwise, distorts relationships among different 
variables, because imputation is done separately for each missing item. Sample corre- 
lations and other statistics are changed. Jinn and Sedransk (1989) discuss the effect of 
different imputation methods on secondary data analysis, for instance for estimating 
a regression slope. 


8.6.3 Hot-Deck Imputation 


In hot-deck imputation, as in cell mean imputation and weighting adjustment methods, 
the sample units are divided into classes. The value of one of the responding units in 
the class is substituted for each missing response. Often, the values for a set of related 
missing items are taken from the same donor, to preserve some of the multivariate 
relationships. The name hot deck is from the days when computer programs and data 
sets were punched on cards—the deck of cards containing the data set being analyzed 
was warmed by the card reader, so the term hot deck was used to refer to imputations 
made using the same data set. Fellegi and Holt (1976) discuss methods for data editing 
and hot-deck imputation with large surveys. 
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How is the donor unit to be chosen? Several methods are possible. 


Sequential Hot-Deck Imputation Some hot-deck imputation procedures impute the 
value in the same subgroup that was last read by the computer. This is partly a car- 
ryover from the card days of computers (imputation could be done in one pass), and 
partly a belief that if the data are arranged in some geographical order, adjacent units 
in the same subgroup will tend to be more similar than randomly chosen units in the 
subgroup. One problem with using the value on the previous “card” is that often the 
nonrespondents also tend to occur in clusters, so one person may be a donor multiple 
times, in a way that the sampler cannot control. One of the other hot-deck imputation 
methods is usually used today for most surveys. 

In our example, person 19 is missing the response for crime victimization. Person 
13 had the last response recorded in her subclass, so the value 1 is imputed. 


Random Hot-Deck Imputation A donor is randomly chosen from the persons in the 
cell with information on all the missing items. To preserve multivariate relationships, 
usually values from the same donor are used for all missing items of a person. 

In our small data set, person 10 is missing both variables for victimization. Persons 
3, 5, and 14 in his cell have responses for both crime questions, so one of the three 
is chosen randomly as the donor. In this case, person 14 is chosen, and his values are 
imputed for both missing variables. 


Nearest-Neighbor Hot-Deck Imputation Define a distance measure between observa- 
tions, and impute the value of a respondent who is “closest” to the person with the 
missing item, where closeness is defined using the distance function. 

If age and sex are used for the distance function, so that the person of closest age 
with the same sex is selected to be the donor, the victimization responses of person 3 
will be imputed for person 10. 


8.64 Regression Imputation 


Regression imputation predicts the missing value using a regression of the item of 
interest on variables observed for all cases. A variation is stochastic regression imputa- 
tion, in which the missing value is replaced by the predicted value from the regression 
model plus a randomly generated error term. 

We only have 18 complete observations for the response crime victimization 
(not really enough for fitting a model to our data set), but a logistic regression of 
the response with explanatory variable age gives the following model for predicted 
probability of victimization, p: 


A 


log = P - = 2.5643 — 0.0896 x age. 


The predicted probability of being a crime victim for a 17-year-old is 0.74; because 
that is greater than a predetermined cutoff of 0.5, the value | is imputed for Person 10. 
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Paulin and Ferraro (1994) discuss regression models for imputing income in the U.S. 
Consumer Expenditure Survey. Households selected for the interview component of 
the survey are interviewed each quarter for five consecutive quarters; in each interview, 
they are asked to recall expenditures for the previous three months. The data are used 
to relate consumer expenditures to characteristics such as family size and income; 
they are the source of reports that expenditures exceed income in certain income 
classes. 

The Consumer Expenditure Survey conducts about 5,000 interviews each year, 
as opposed to about 60,000 for the CPS. This sample size is too small for hot-deck 
imputation methods, as it is less likely that suitable donors will be found for nonre- 
spondents in a smaller sample. If imputation is to be done at all, a parametric model 
needs to be adopted. Paulin and Ferraro used multiple regression models to predict 
the log of family income (logarithms are used because the distribution of income is 
skewed) from explanatory variables including total expenditures and demographic 
variables. These models assume that income items are MAR given the covariates. = 


8.6.5 Cold-Deck Imputation 
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In cold-deck imputation, the imputed values are from a previous survey or other 
information, such as from historical data. (Since the data set serving as the source 
for the imputation is not the one currently running through the computer, the deck 
is “cold.”) As with hot-deck imputation, cold-deck imputation is not guaranteed to 
eliminate selection bias. 


Kirkman et al. (2005) describe the imputation procedures used in the 2004 Annual 
Survey of the Mathematical Sciences, which reports on faculty composition and 
degrees awarded by departments of mathematical sciences in U.S. colleges and uni- 
versities. Departments are grouped into seven classes (p. 883): groups (I), (ID, and (III) 
are doctoral-granting departments of mathematics; group (IV) consists of doctoral- 
granting departments of statistics, biostatistics, and biometrics; group (Va) consists 
of applied mathematics doctoral-granting departments; group (M) consists of depart- 
ments whose highest graduate degree is a master’s degree; and group (B) consists of 
departments granting only a baccalaureate degree. The questionnaire is sent to every 
institution in groups (I), (1), (ID, ([V), and (Va); it is sent to a stratified sample 
of institutions in groups (M) and (B). The response rate in each group is generally 
between 90% and 100%. Before 2001, population totals were estimated by using 
the data from the respondents with simple projections (essentially, the weights for 
the respondents were increased to compensate for the nonrespondents). Beginning in 
2001, the survey uses cold-deck imputation. If a doctoral department does not respond 
in the current year but has responded during the previous three years, the responses 
from the previous questionnaire are imputed for the current year’s data. = 


8.6.6 Multiple Imputation 


In multiple imputation, each missing value is imputed m (=2) different times. 
Typically, the same stochastic model is used for each imputation. These create m 
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different “data” sets with no missing values. Each of the m data sets is analyzed as if 
no imputation had been done; the different results give the analyst a measure of the 
additional variance due to the imputation. Multiple imputation with different models 
for nonresponse can give an idea of the sensitivity of the results to particular nonre- 
sponse models. See Rubin (1987, 1996, 2004) for details on implementing multiple 
imputation. Schenker et al. (2006) describe methods used for multiple imputation of 
income items in the U.S. National Health Interview Survey. 


6.6.7 Advantages and Disadvantages of Imputation 


B.] 


Imputation creates a “clean,” rectangular data set that can be analyzed by standard 
software. Analyses of different subsets of the data will produce consistent results. 
If the nonresponse is missing at random given the covariates used in the imputa- 
tion procedure, imputation substantially reduces the bias due to item nonresponse. If 
parts of the data are confidential, the collector of the data can perform the imputa- 
tion. The data collector has more information about the sample and population than 
is released to the public (for example, the collector may know the exact address 
for each sample member), and can often perform a better imputation using that 
information. 

The foremost danger of using imputation is that future data analysts will not 
distinguish between the original and the imputed values. Ideally, the imputer should 
record which observations are imputed, how many times each nonimputed record 
is used as a donor, and which donor was used for a specific response imputed to a 
recipient. The imputed values may be good guesses, but they are not real data. 

If you treat the imputed values as though they were observed in the survey, the 
estimated variance will be too small. This is partly because of the artificial increase 
in the sample size and partly because the imputed values are treated as though they 
were really obtained in the data collection. The true variance will be larger than that 
estimated from a standard software package. Rao (1996), Fay (1996), and Shao (2003) 
discuss methods for estimating the variances after imputation. 


Parametric Models for Nonresponse* 


Most of the methods for dealing with nonresponse assume that the nonresponse is 
ignorable—that is, conditionally on measured covariates, nonresponse is independent 
of the variables of interest. In this situation, rather than simply dividing units among 
different subclasses and adjusting weights, one can fit a superpopulation model. From 
the model, then, one predicts the values of the y’s not in the sample. The model-fitting 
is often iterative. Often, Bayesian methods are used to fit the model. 

In acompletely model-based approach, we develop a model for the complete data 
and add components to the model to account for the proposed nonresponse mech- 
anism. Such an approach has many advantages over other methods: the modeling 
approach is flexible and can be used to include any knowledge about the nonresponse 
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mechanism, the modeler is forced to state the assumptions about nonresponse explic- 
itly in the model, and some of these assumptions can be evaluated. In addition, variance 
estimates that result from fitting the model account for the nonresponse, if the model 
is a good one. 


Many people believe that spotted owls in Washington, Oregon, and California are 
threatened with extinction because timber harvesting in mature coniferous forests 
reduces their available habitat. Good estimates of the size of the spotted owl population 
are needed for reasoned debate on the issue. 

In the sampling plan described by Azuma et al. (1990), a region of interest is 
divided into N sampling regions (psus), and an SRS of n psus is selected. Let Y; = 1 
if psu i is occupied by a pair of owls, and 0 otherwise. Assume that the Y;’s are 
independent and that P(Y; = 1) = p, the true proportion of occupied psus. If occupancy 
could be definitively determined for each psu, the proportion of psus occupied could 
be estimated by the sample proportion y. While a fixed number of visits can establish 
that a psu is occupied, however, a determination that a psu is unoccupied may be 
wrong—some owl pairs are “nonrespondents,” and ignoring the nonresponse will 
likely result in a too-low estimate of percentage occupancy. 

Azuma et al. (1990) propose using a geometric distribution for the number of visits 
required to discover the owls in an occupied unit, thus modeling the nonresponse. The 
assumptions for the model are that: (1) the probability of determining occupancy on 
the first visit, 7, is the same for all psus, (2) each visit to a psu is independent, and 
(3) visits can continue until an owl is sighted. A geometric distribution is commonly 
used for number of callbacks needed in surveys of people (see Potthoff et al., 1993). 

Let X; be the number of visits required to determine whether psu i is occupied or 
not. Under the geometric model, 


P(X =x | ¥;=1) =n —ny", forx = 1,2,3,.... 


The budget of the U.S. Forest Service, however, does not allow for an infinite 
number of visits. Suppose a maximum of s visits are to be made to each psu. The 
random variable Y; cannot be observed; the observable random variables are 


vy afk ifKi = 1X =kandX <s 
‘10 otherwise 


and 


U.= 1 if Y¥;=1andX; <s 
‘10 otherwise. 


Here, >7;< U; counts the number of psus observed to be occupied, and )°j-5 Vi 


counts the total number of visits made to occupied units. Using the geometric model, 
the probability that an owl is first observed in psu i on visit k(<s) is 


P(V; =k) =n — 7) 'p 
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and the probability that an owl is observed on one of the s visits to psu i is 


PU; = 1) = E[Uj] = [1 — (1 — 0)" Jp. 


Thus the expected value of the sample proportion of occupied units, E[U], is [1 — 
(1 — n)*]p, and is less than the proportion of interest p if 7 < 1. The geometric model 
agrees with the intuition that owls are missed in the s visits. 

We find the maximum likelihood estimates of p and 7 under the assumption that 
all psus are independent. The likelihood function 


p= OO apa = ay 


is maximized when 


7 ul 
p=———7 
t= 9y 
and when 7) solves 
ae! sd—-ny 
uy. 1 say 


numerical methods are needed to calculate 7. Maximum likelihood theory also allows 
calculation of the asymptotic covariance matrix of the parameter estimates. 
An SRS of 240 habitat psus in California had the following results: 


Visit Number 1 2 3 4 5 6 


Number of 
occupied psus 33 17 12 vi 7 5 


A total of 81 psus were observed to be occupied in six visits, so “4 = 81/240 = 0.3375. 
The average number of visits made to occupied units was v/u = 196/81 = 2.42. Thus 
the maximum likelihood estimates are 7) = 0.334 and p = 0.370; using the asymptotic 
covariance matrix from maximum likelihood theory, we estimate the variance of p by 
0.00137. Thus, an approximate 95% confidence interval for the proportions of units 
that are occupied is 0.370 + 0.072. 

Incorporating the geometric model for number of visits gave a larger estimate of 
the proportion of units occupied. If the model does not describe the data, however, 
the estimate p will still be biased; if the model is poor, p may be a worse estimate 
of the occupancy rate than uw. If, for example, field investigators were more likely to 
find owls on later visits because they accumulate additional information on where to 
look, the geometric model would be inappropriate. 

We need to check whether the geometric model adequately describes the number 
of visits needed to determine occupancy. Unfortunately, we cannot determine whether 
the model would describe the situation for units in which owls are not detected in 
six visits, as the data are missing. We can, however, use a x? goodness-of-fit test to 
see whether data from the six visits made are fit by the model. Under the model, we 
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expect nn(1 — n)*~'p of the psus to have owls observed on visit k, and we plug in our 
estimates of p and 7 to calculate expected counts: 


Observed Expected 


Visit Count Count 
1 33 29.66 
2 17 19.74 
3 12 13.14 
4 7 8.75 
5,6 12 9.71 
Total 81 80.99 


Visits 5 and 6 were combined into one category so that the expected cell count would be 
greater than 5. The x” test statistic is 1.75, with p-value > 0.05. There is no indication 
that the model is inadequate for the data we have. We cannot check its adequacy for 
the missing data, however. The geometric model assumes observations are indepen- 
dent, and that an occupied psu would eventually be determined to be occupied if 
enough visits were made. We cannot check whether that assumption of the model is 
reasonable or not: If some wily owls will never be detected in any number of visits, 
P will still be too small. = 


Maximum likelihood methods are often used to estimate parameters in nonre- 
sponse models. Calculation of estimates required numerical methods even for the 
simple model adopted for the owls, and that was a simple random sample with a sim- 
ple geometric model for the response mechanism that allowed us to easily write down 
the likelihood function. Likelihood functions for more complex sampling designs or 
nonresponse mechanisms are much more difficult to construct (particularly if obser- 
vations in the same cluster are considered dependent), and calculating estimates often 
requires intensive computations. Little and Rubin (2002) discuss likelihood-based 
methods for missing data in general. Stasny (1991) gives an example of using models 
to account for nonresponse. 


0.0 


What Is an Acceptable Response Rate? 


Often an investigator will say, “I expect to get a 60% response rate in my survey. 
Is that acceptable, and will the survey give me valid results?” As we have seen in 
this chapter, the answer to that question depends on the nature of the nonresponse: 
If the nonrespondents are MCAR, then we can largely ignore the nonresponse and use 
the respondents as a representative sample of the population. If the nonrespondents 
tend to differ from the respondents, then the biases in the results from using only the 
respondents may make the entire survey worthless. 

Many references give advice on cut-offs for acceptability of response rates. 
Babbie, for example, says: “A review of the published social research literature sug- 
gests that a response rate of at least 50 percent is considered adequate for analysis and 
reporting. A response of 60 percent is good; a response rate of 70 percent is very good” 
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(2007, 262). I believe that giving such absolute guidelines for acceptable response 
rates is dangerous and has led many survey investigators to unfounded complacency 
about nonresponse; many examples exist of surveys with a 70% response rate whose 
results are flawed. The NCVS needs corrections for nonresponse bias even with a 
response rate of about 95%. 

Be aware that response rates can be manipulated by defining them differently. 
Researchers often do not say how the response rate was calculated or may use an 
estimate of response rate that is smaller than it should be. Many surveys inflate the 
response rate by eliminating units that could not be located from the denominator. Very 
different results for response rate accrue depending on which definition of response 
rate is used; all of the following have been used in surveys: 


number of completed interviews 


number of units in sample 


number of completed interviews 


number of units contacted 


completed interviews + ineligible units 


> 


contacted units 


completed interviews 


contacted units — (ineligible units)’ 


completed interviews 


contacted units — (ineligible units) — refusals’ 


Note that a “response rate” calculated using the last formula will be much higher than 
one calculated using the first formula because the denominator is smaller. 

The American Association of Public Opinion Research (2008b) gives guidelines 
for classifying units in the sample as eligible, complete or partial interviews, refusals, 
or other categories, and gives six definitions for different response rates. They rec- 
ommend that the quantities used in calculating response rate should be defined for 
every survey. The AAPOR guidelines are available online at www.aapor.org; these are 
widely accepted as the standards for reporting response rates, and using them allows 
response rates reported by different surveys to be compared. 

The U.S. Office of Management and Budget (2006) guidelines require that a 
nonresponse bias assessment be performed when the expected unit response rate is 
below 80%, or the expected item response rate is below 70%, based on the definitions 
given in the document for calculating response rate. The following recommendations 
from the U.S. Office of Management and Budget’s Federal Committee on Statistical 
Methodology, reported in Gonzalez (1994), are helpful: 


Recommendation 1. Survey staffs should compute response rates in a uniform fashion 
over time and document response rate components on each edition of a survey. 


Recommendation 2. Survey staffs for repeated surveys should monitor response rate 
components (such as refusals, not-at-homes, out-of-scopes, address not locatable, post- 
master returns, etc.) over time, in conjunction with routine documentation of cost and 
design changes. 
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Recommendation 3. Response rate components should be published in survey reports; 
readers should be given definitions of response rates used, including actual counts, and 
commentary on the relevance of response rates to the quality of the survey data. 


Recommendation 4. Some research on nonresponse can have real payoffs. It should 
be encouraged by survey administrators as a way to improve the effectiveness of data 
collection operations. 


Chapter Summary 


Nonresponse and undercoverage present serious problems for survey inference. The 
main concern is that failure to obtain information from some units in the selected 
sample (nonresponse), or failure to include parts of the population in the sampling 
frame (undercoverage), can result in biased estimates of population quantities. 

The survey design should include features to minimize nonresponse. Designed 
experiments can give insight into methods for increasing response rates. If possible, 
the survey frame should contain some information on everyone in the selected sample 
so that respondents and nonrespondents can be compared on those variables, and so 
that the auxiliary information can be used in adjusting for residual nonresponse. 

Weighting adjustment methods and models can be used to try to reduce nonre- 
sponse bias. In weighting class methods, the weights of respondents in a grouping 
class are increased to compensate for the nonrespondents in that grouping class. In 
poststratification, the weights of respondents in a poststratum are increased so that 
they sum to an independent count of the population in that poststratum. The nonre- 
sponse mechanism can also be modeled explicitly. 

Imputation methods create a “complete” data set by filling in values for data 
that are missing because of item nonresponse. You must be careful when analyzing 
imputed data sets to account for the imputation when estimating variances since the 
imputed values are usually derived from the data. 

All surveys should report nonresponse rates. If imputation is used, the imputed 
values should be flagged so that data analysts know which values were observed and 
which values were imputed. 


Key Terms 
Imputation: Methods used to “fill in” values for missing items so that the data set 
appears complete. 


Item nonresponse: Occurs when a unit has responses to some but not all of the items 
in the survey instrument. 


Nonresponse bias: Bias that occurs because nonrespondents differ from survey 
respondents. 


Propensity score: The probability that a unit will respond to the survey. 


Raking: A weighting adjustment method in which weights are iteratively adjusted 
to match row and column population totals. 


Respondent: A unit in the selected sample that provides data for the survey. 
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Selected sample: The set of population units selected to be in the sample; this 
includes the respondents and nonrespondents. 


Two-phase sampling: A method of sampling in which, after an initial probability 
sample is selected, a probability subsample is selected using inclusion probabilities 
that may depend on results from the initial sample. 


Undercoverage: Occurs when the sampling frame does not include all of the popu- 
lation of interest. 


Unit nonresponse: A failure to obtain any information from the observation unit. 


For Further Reading 


Madow et al. (1983), Groves (1989), Lessler and Kalsbeek (1992), and Groves et al. 
(2002) cover many topics about nonresponse adjustment, from both statistical and 
social science viewpoints. Little and Rubin (2002) is a general reference on methods 
for dealing with missing data (not necessarily in surveys), and is a good reference 
for model-based approaches. References for more information on weighting are Oh 
and Scheuren (1983), Holt and Elliot (1991), and Bethlehem (2002). Sarndal and 
Lundstr6m (2005) give methods for adjusting for nonresponse under the unifying 
umbrella of calibration. Dalenius (1981) emphasizes the importance of dealing with 
nonsampling as well as sampling errors. References for imputation include Kalton 
and Kasprzyk (1986), Marker et al. (2002), and Rassler et al. (2008). The journals 
Survey Methodology, Journal of Official Statistics, and Public Opinion Quarterly 
publish many articles on experiments that have been done to reduce nonresponse in 
surveys of persons. 


A. Introductory Exercises 


Ryan et al. (1991) report results from the Ross Laboratories Mothers’ Survey, a 
national mail survey investigating infant feeding in the United States. Questionnaires 
asking mothers about type of milk fed to the infant during each of the first six months 
and about socioeconomic variables were mailed to a sample of mothers of six-month- 
old infants. The authors state that the number of questionnaires mailed increased 
from 1984 to 1989: “In 1984, 56,894 questionnaires were mailed and 30,694 were 
returned. In 1989, 196,000 questionnaires were mailed and 89,640 were returned.” 
Low-income families were oversampled in the survey design because they had the 
lowest response rates. Respondents were divided into subclasses defined by region, 
ethnic background, age, and education; weights were computed using information 
from the U.S. Census Bureau. 


a What are the advantages and drawbacks of oversampling the low-income families 
in this survey? What implicit model is adopted for nonresponse? 

b Weighted counts are “comparable with those published by the U.S. Bureau of 
the Census and the National Center for Health Statistics” on ethnicity, maternal 
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c 


Chapter 8: Nonresponse 


age, income, education, employment, birth weight, region, and participation in 
the Women, Infants, and Children supplemental food program. The weighted 
counts estimated that about 53% of mothers had one child, while the government 
data indicated that about 43% of mothers had one child. Does the agreement of 
weighted counts with official statistics indicate that the weighting corrects the 
nonresponse bias? Explain. 


Discuss the use of weighting in this survey. Can you think of any improvements? 


Investigators selected an SRS of 200 high school seniors from a population of 2000 
for a survey of television-viewing habits, with an overall response rate of 75%. By 
checking school records, they were able to find the grade point average for the non- 
respondents, and classify the sample accordingly: 


f 
g 


Sample Number of Hours of TV 


GPA Size Respondents y Sy 
3.00—-4.00 75 66 32 15 
2.00-2.99 72 58 41 19 
Below 2.00 53 26 54 25 
Total 200 150 


What is the estimate for the average number of hours of TV watched per week if 
only respondents are analyzed? What is the standard error of the estimate? 


Perform a x” test for the null hypothesis that the three GPA groups have the same 
response rates. What do you conclude? What do your results say about the type 
of missing data: Do you think the data are MCAR? MAR? Nonignorable? 


Perform a one-way ANOVA analysis to test the null hypothesis that the three GPA 
groups have the same mean level of television viewing. What do you conclude? 
Does your ANOVA analysis indicate that GPA would be a good variable for 
constructing weighting cells? Why, or why not? 


Use the GPA classification to adjust the weights of the respondents in the sample. 
What is the weighting class estimate of the average viewing time? 


The population counts are 700 students with GPA between 3 and 4; 800 students 
with GPA between 2 and 3; and 500 students with GPA less than 2. Use these 
population counts to construct a poststratified estimate of the mean viewing time. 


What other methods might you use to adjust for the nonresponse? 


What other variables might be collected that could be used in nonresponse models? 


The following description and assessment of nonresponse is from a study of Hamilton, 
Ontario, homeowners’ attitudes on composting toilets: 


The survey was carried out by means of a self-administered mail questionnaire. Twelve 
hundred questionnaires were sent to a randomly selected sample of house-dwellers. 
Follow-up thank you notes were sent a week later. In total, 329 questionnaires were 
returned, representing a response rate of 27%. This was deemed satisfactory since 
many mail surveyors consider a 15 to 20% response rate to be a good return (Wynia 
et al., 1993, p. 362). 
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Do you agree that the response rate of 27% is satisfactory? Suppose the investigators 
came to you for statistical advice on analyzing these data and designing a follow-up 
survey. What would you tell them? 


Kosmin and Lachman (1993) had a question on religious affiliation included in 56 
consecutive weekly household surveys; the subject of household surveys varied from 
week to week from cable TV use, to preference for consumer items, to political issues. 
After four callbacks, the unit nonresponse rate was 50%; an additional 2.3% refused 
to answer the religion question. The authors say: 


Nationally, the sheer number of interviews and careful research design resulted in a 
high level of precision ... Standard error estimates for our overall national sample 
show that we can be 95% confident that the figures we have obtained have an error 
margin, plus or minus, of less than 0.2%. This means, for example, that we are more 
than 95% certain that the figure for Catholics is in the range of 25.0% to 26.4% for the 
U.S. population. (p. 286): 


Critique the preceding statement. 


If you anticipated item nonresponse, do you think it would be better to insert the 
question of interest in different surveys each week, as was done here, or to use 
the same set of additional questions in each survey? Explain your answer. How 
would you design an experiment to test your conjecture? 


B. Working with Survey Data 


The issue of nonresponse in the Winter Break Closure Survey (in file winter.dat) was 
briefly mentioned in Exercise 19 of Chapter 3. What model is adopted for nonresponse 
when the formulas from stratified sampling are used to estimate the proportion of 
university employees who would answer “yes” to the question “Would you want to 
have Winter Break Closure again?” Do you think this is a reasonable model? How 
else might you model the effects of nonresponse in this survey? What additional 
information could be collected to adjust for unit nonresponse? 


The American Statistical Association (ASA) studied whether it should offer a certi- 
fication designation for its members, so that statisticians meeting the qualifications 
could be designated as “Certified Statisticians.” In 1994, the ASA surveyed its mem- 
bership about this issue, with data in file certify.dat. The survey was sent to all 18,609 
members; 5001 responses were obtained. Results from the survey were reported in 
the October 1994 issue of Amstat News. 

Assume that in 1994, the ASA membership had the following characteristics: 55% 
have Ph.D.’s and 38% have Master’s degrees; 29% work in industry, 34% work in 
academia, and 11% work in government. The cross-classification between education 
and workplace was unavailable. 


a What are the response rates for the various subclasses of ASA membership? Are 
the nonrespondents MCAR? Do you think they are MAR? 


b Use raking to adjust the weights for the six cells defined by education (Ph.D. 
or non-Ph.D.) and workplace (industry, academia, or other). Start with an initial 
weight of 18,609/5001 for each respondent. What assumptions must you make to 
use raking? 
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ce Can you conclude from this survey that a majority of the ASA membership 
opposed certification in 1994? Why, or why not? 


The ACLS survey in Example 3.4 had nonresponse. Calculate the response rate in each 
stratum for the survey. What model was adopted for the nonresponse in Example 3.4? 
Is there evidence that the nonresponse rate varies among the strata, or that it is related 
to the percentage female membership? 


Weights are used in the Survey of Youth in Custody (discussed in Example 7.7) to 
adjust for unit nonresponse. Use a hot-deck procedure to impute values for the variable 
measuring with whom the youth lived when growing up. What variables will you use 
to group the data into classes? 


Repeat Exercise 8, using a regression imputation model. 
Repeat Exercise 8, for the variable “have used illegal drugs.” 
Repeat Exercise 9, for the variable “have used illegal drugs.” 


Gnap (1995) conducted a survey on teacher workload which was used in Exercise 15 
of Chapter 5. 


a The original survey was intended as a one-stage cluster sample. What was the 
overall response rate? 


b Would you expect nonresponse bias in this study? If so, in which direction would 
you expect the bias to be? Which teachers do you think would be less likely to 
respond to the survey? 


ec Gnap also collected data on a random subsample of the nonrespondents in the 
“large” stratum, in file teachnr.dat. How do the respondents and nonrespondents 
differ? 


d_ Is there evidence of nonresponse bias, when you compare the subsample of non- 
respondents to the respondents in the original survey? 


Not all of the parents surveyed in the study discussed in Exercise 16 of Chapter 5 
returned the questionnaire. In the original sampling design, 50 questionnaires were 
mailed to parents of children in each school, for a total planned sample size of 500. 
We know that of the 9962 children who were not immunized during the campaign, 
the consent form had not been returned for 6698 of the children, the consent form had 
been returned but immunization refused for 2061 of the children, and 1203 children 
whose parents had consented were absent on immunization day. 


a Calculate the response rate for each cluster. What is the correlation of the response 
rate and the percentage of respondents in the school who returned the consent 
form? Of the response rate and the percentage of respondents in each school who 
refused consent? 


b Overall, about 67% (6698/9962) of the parents in the target population did not 
return the consent form. Using the data from the respondents, calculate a 95% CI 
for the proportion of parents in the sample who did not return the consent form. 
Calculate two additional interval estimates for this quantity: one assuming that 
the missing values are all 0’s, and one assuming that the missing values are all 
1’s. What is the relation between your estimates and the population quantity? 
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ce Repeat part (b), examining the percentage of parents that returned the form but 
refused to have their children immunized. 


d Do you think nonresponse bias is a problem for this survey? 


Use the data in file agpop.dat for this exercise. Let y; be the value of acres92 for unit 

i and x; be the value of acres87 for unit i. Draw an SRS of size 400. Now generate 

missing data from your sample by generating a standard uniform random variable 

U; for each observation and deleting the observation if 16U; > In (x;). (Sample SAS 

code for this is on the website.) 

a_ If youignore the missing data, do you expect the mean of y to be too large or too 
small? 

b Compute the mean from the data set with missing values. Does a 95% CI, com- 
puted ignoring the nonresponse, contain the true mean from the population? 


ce Now poststratify the sample by region, using the stratum population sizes in 
Example 3.2 to adjust the sampling weight in each region. Does this appear to 
reduce the bias? 


d_ Try a different poststratification. This time, form 4 groups based on the value of 
x; in the population, and find the number of population counties in each group. 
How do the results of this poststratification compare with the poststratification by 
region? 


Repeat Exercise 14, using weighting class methods instead of poststratification. With 
weighting class methods, you adjust the weights using counts from the selected sample 
rather than the population. 


C. Working with Theory 


Let Z; = | if unit 7 is included in the sample and 0 otherwise, with P(Z; = 1)=7;. 
Let R; =1 if unit 7 responds to the survey and 0 otherwise, with P(R; = 1)=¢; 
and dy = ae ¢;/N. Assume R; is independent of Z; for each i = 1,...,N. 
Let Tp estimate the population mean y,;, = yy y;/N using only the respondents: 


N 
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Jp = N 
> ZiRiwi 
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where w; = | /z7;. Show that the bias of yp is approximately 
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where Cov (¢, y) = paam (o; — oy)Oi — Yy)/(N — 1). As a consequence, the nonre- 
sponse bias is approximately zero if either (1) ¢;= @y for all i, that is, the response 
propensity is the same for all units, or (2) the propensity to respond is uncorrelated 
with the response y; (Tremblay, 1986). 
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Let Z; and R; be as defined in Exercise 16. Divide the sample into C weighting classes 
and define x,; = 1 if unit iis in class c and 0 otherwise. Let ¢, = Fy PiXci / ae Xeis 


N 
>» ZiRWiXci 
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and 
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c=1 Te 
Show that if the weighting classes are sufficiently large, 
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Bias (fe) © ‘2 > XeihilVi — Yeas) /Pes 


c=1 i=1 


where yyy = ae ViXci/ ey X,i. Thus, the weighting class adjustments for nonre- 
sponse in Section 8.5 produce an approximately unbiased estimator if (a) ¢; =, for 
all units in class c, (b) yi; =¥.2, for all units in class c, or (c) within each class c, the 
propensity to respond is uncorrelated with y;. 


Let Z; and R; be as defined in Exercise 16. Divide the sample into H poststrata. Let 
N;, be the number of population units in poststratum h, obtained from an indepen- 
dent source such as a population register or census. Show that if the poststrata are 
sufficiently large, 


H 
sone a Nn = 
Bias (tpost) x ) ay COV ACs y)/Pn 
h=1 


where oy = oy XniPi/Nn and Cov ,(¢, y) is the population covariance of the y,’s 
and ¢;’s for population units in poststratum h. 


Effect of weighting class adjustment on variances. Suppose that an SRS of size n is 
taken. Let Z; = | if unit 7 is included in the sample and 0 otherwise, with P(Z; = 1) = 
n/N. Two weighting classes are used to adjust for nonresponse; define x; = | if unit i 
is in class | and 0 if unit 7 is in class 2. Let R; = 1 if unit 7 responds to the survey and 
0 otherwise. Assume that the R;’s are independent Bernoulli random variables with 
P(R; = 1)=x;¢ + U — x;)bo, and that R; is independent of Z,,...,Zy. The sample 
sizes in the two classes are nj = ee ZX; and nz = pam Z(1 — x;); note that n,; and 
ny are random variables. Similarly, the number of respondents in the two classes are 
Nirg= pee ZRix; and nor = pe Z;Ri(1 — x;). Assume the number of respondents 
in each group is sufficiently large so that E[n./n.r] © 1/¢, for c = 1,2. With these 
assumptions, the weighting class adjusted estimator of the mean, 


N N 
A ny, 1 nm 1 
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is approximately unbiased for the population mean y,, (see Exercise 17). Find the 


approximate variance of Vics Hint: Use Property A.4 of Conditional Expectation in 
Section A.4. 


The Hartley (1946) and Politz—Simmons (1949) method. Suppose that all calls are 
made during Monday through Friday evenings. Each respondent is asked whether 
he or she was at home at the time of the interview, on each of the four preceding 
weeknights. Respondent i replies that she was home k; of the 4 nights. It is then 
assumed that the probability of response is proportional to the number of nights 
at home during interviewing hours, so the probability of response is estimated by 
$; = (ki + 1)/5. Let 


2 wiyi/ $i 


a _ icS 

YHPS = Wo 
> wi/ i 
iceS 


a Under what circumstances would you expect the method to reduce bias due to non- 
response? What assumptions must be made for the estimator to be approximately 
unbiased? 


b What are some potential drawbacks of the method for use in practice? How does 
it adjust for persons who were not at home during any of the five nights, or who 
refused to participate in the survey? 


D. Projects and Activities 


Find a recent poll on a website. How do they describe the sources of error in the 
survey? Do they give the nonresponse rate, or reference a document that details the 
treatment of nonresponse? How do they adjust for nonresponse in their estimates? 


Find an example of a survey in a popular newspaper or magazine. Is the nonresponse 
rate given? If so, how was it calculated? How do you think the nonresponse might 
have affected the conclusions of the survey? Give suggestions for how the journalist 
could discuss nonresponse problems in the article. 


Find an example of a survey in a scholarly journal. How did the authors calculate 
the nonresponse rate? How did the survey deal with nonresponse? How do you think 
the nonresponse might have affected the conclusions of the study? Do you think the 
authors adequately account for potential nonresponse biases? What suggestions do 
you have for future studies? 


The U.S. National Science Foundation Division of Science Resources Studies pub- 
lished results from the 2003 Survey of Doctorate Recipients in “Characteristics of 
Doctoral Scientists and Engineers in the United States: 2003.”? How does this sur- 
vey deal with nonresponse, discussed on page 153 of the report? Do you think that 
nonresponse bias is a problem for this survey? 


3NSF Publication 06-320. Available at www.nsf. gov/statistics/nsf06320/pdf/nsf06320.pdf. 
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How did the survey you critiqued in Exercise 26 of Chapter 7 deal with nonre- 
sponse? In your opinion, did the investigators adequately address the problems of 
nonresponse? What suggestions do you have for improvement? 


Answer the questions in Exercise 25 for the survey you examined in Exercise 30 of 
Chapter 7. 


Activity for course project. Return to the data you chose in Exercise 31 of Chap- 
ter 7. What kinds of nonresponse occur in your data set? How does the survey define 
nonresponse rate, and what are the nonresponse rates for the survey? What methods 
are used to try to adjust for the nonresponse? 


Variance Estimation in Complex 
Surveys 


Text not available due to copyright restrictions 


Population means and totals are easily estimated using weights. Estimating variances 
is more intricate: In Chapter 7 we noted that in a complex survey with several levels of 
stratification and clustering, variances for estimated means and totals are calculated 
at each level and then combined as the survey design is ascended. Poststratification 
and nonresponse adjustments also affect the variance. 

In previous chapters, we have presented and derived variance formulas for a 
variety of sampling plans. Some of the variance formulas, such as those for simple 
random samples (SRSs), are relatively simple. Other formulas, such as v@ from a 
two-stage cluster sample without replacement, are more complicated. All work for 
estimating variances of estimated totals. But we often want to estimate other quantities 
from survey data for which we have presented no variance formula. For example, in 
Chapter 4 we derived an approximate variance for a ratio of two means when an SRS 
is taken. What if you want to estimate a ratio, but the survey is not an SRS? How 
would you estimate the variance? 

This chapter describes several methods for estimating variances of estimated totals 
and other statistics from complex surveys. Section 9.1 describes the commonly used 
linearization method for calculating variances of estimators. Sections 9.2 and 9.3 
present random group and resampling methods for calculating variances of linear 
and nonlinear statistics. Section 9.4 describes the calculation of generalized variance 
functions, and Section 9.5 describes constructing confidence intervals (CIs). 
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Linearization (Taylor Series) Methods 


Most of the variance formulas in Chapters 2 through 6 were for estimators of means 
and totals. Those formulas can be used to find variances for any linear combination of 


estimated means and totals. Let y;; be the response of unit i to item j. Suppose Hixwcinte 
are unbiased estimators of the k population totals t),... , t;, with i= = Dene s Wiyij- Then, 
for any constants aj,...,a,, we can define a new variable 


k 
Gi = > aij 
j=l 


so that 


and 


v (doa = Vi,) = Ye vay 429° 7 ajajCov (ij, 7). (9.1) 


j=1 l=j+1 


Thus, if ¢; is the total number of dollars robbery victims reported stolen, f2 is the 
number of days of work they missed because of the crime, and #3 is their total medical 
expenses, one measure of financial consequences of robbery (assuming $150 per day 
of work lost) might be t) + 150f + f3. By (9.1), the variance is 


V(t + 150% + #3) = VGy) . P 
= V(t) + 1502V (is) + V(ts) 
+ 300 Cov(f,, t2) + 2 Cov(t, #3) + 300 Cov(tn, t3), 


where gq; = yi + 150yj2 + yj3 is the financial loss from robbery for person i. 

Suppose, though, that we are interested in the proportion of total loss accounted for 
by the stolen property, t; /t,. This is not a linear statistic, as t, /t, cannot be expressed 
in the form at, + d2t, for constants a; and a. But Taylor’s theorem from calculus 
allows us to linearize a smooth nonlinear function A(t, t,..., t,) of the population 
totals; Taylor’s theorem gives the constants do, a1,..., a, so that 


k 
A(ti,...,t,) ~ aot So aitj. 
j=l 


Then V[A(t1,....%)] may be approximated by V( as or ajtj), which we know how to 
calculate using 0. 1). 

Taylor series approximations have long been used in statistics to calculate approx- 
imate variances. Woodruff (1971) illustrates their use in complex surveys. Binder 
(1983) gives a more rigorous treatment of Taylor series methods for complex sur- 
veys and tells how to use linearization when the parameter of interest 6 solves 
A(O,t1,...,t) =O, but 6 is an implicit function of t),..., t. 
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FIGURE 9.1 
The function h(x) = x(1 — x), along with the tangent to the function at point p. If p is close to 
p, then h(p) will be close to the tangent line. The slope of the tangent line is h’(p) = 1 — 2p. 


h(P) PF 


EXAMPLE 9.1 The quantity 6 = p(1 — p), where p is a population proportion, may be estimated by 
6 = p(l — p). Assume that p is an unbiased estimator of p and that V(p) is known. 
Let h(x) = x11 — x), so 0 = h(p) and 6= h(p). Now h is a nonlinear function of 
x, but the function can be approximated at any nearby point a by the tangent line to 
the function; the slope of the tangent line is given by the derivative, as illustrated in 
Figure 9.1. 
The first-order version of Taylor’s theorem states that if the second derivative of 
h is continuous, then 


h(x) = h(a) + h'(a)(x — a) + [ (x — ph" (t)dt; 


under conditions commonly satisfied in statistics, the last term is small relative to the 
first two and we use the approximation 


h(p) © W(p) + h'(p\P — p) 
= p(l — p)+ (1 — 2p)(p — p). 
Then, 
V[A(B)] © (1 — 2p VG — p), 
and V(p) is known, so the approximate variance of h(p) can be estimated by 


V[A(p)] = 1 — 2p Vp). 


The following are the basic steps for constructing a linearization estimator of the 
variance of a nonlinear function of means or totals: 


1 Express the quantity of interest as a function of means or totals of variables 
measured or computed in the sample. In general, 0=hA(t,t,...,t%) or 
O=h(Q,y,---.¥Ygy)- In Example 9.1, 0=A(¥y) = h(p) = pC — p) and 6= hip). 

2 Find the partial derivative of h with respect to each argument. The partial deriva- 
tives, evaluated at the population quantities, form the linearizing constants qj. 
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3 Apply Taylor’s theorem to linearize the estimate: 
k 
Witstoy ste) © Wt...) + D> afl — 5), 
j=l 
where 


dh(cy, C2, res > Ck) 
Qa =, 


dc; Hh Qseath 


4 Define the new variable g by 


k 
Gi = a aij 
j=l 


Now find the estimated variance of by — Die s Wigi, Substituting estimators for 
unknown population quantities. This will generally approximate the variance of 
6=hh,...,%). 


EXAMPLE 9.2  Weused linearization methods to approximate the variance of the ratio and regression 
estimators in Chapter 4. In Chapter 4, we used an SRS, estimator B = y/x = & /i,, and 
the approximation 


Bop = 


oie 


z xu ieS 


In (4.10), we estimated the variance by 
(6) = (1- =) = 
7 N/ nx’ 


where s2 is the sample variance of the residuals e; = y; — Bx;. Essentially, we used 
Taylor’s theorem to obtain this estimator. The steps below give the same result. 


1 Express B as a function of the population totals. Let h(c, d) = d/c, so 


A 


b ee Cee 
B= h(t,, ty) = 5 and B=A(t,,ty)= =. 
x ty 
Assume that the estimators 7, and iy are unbiased. 
2 The partial derivatives are 
ah(c, d) d dh(c,d) 1 
=-—— and =-; 
dc on dd c 
evaluated at c=t, and d= 14, these are —t,/ £ and 1/t,. 
3 By Taylor’s theorem, 
B= h(i.) 
dh(c, d) 7 oh(c, d) m 
h(t, ty ty — ty ly — ty). 
(bes) + ac as vs ad en y— by) 
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Using the partial derivatives from Step 2, 


by 
& 
v 


bys Ins 
7 (th — te) + ; (ty — ty). 


x x 


4 The approximate mean squared error (MSE) of Bis 


A fies 1. 2 
E((B- BY] E | 2 G&-b)+ -& | (9.2) 
= S1B°VG) + V(i,) — 2B Cov(h,, 4y)}.- (9.3) 


x 


Substitute estimators of the unknown quantities into (9.2) to define 


1 a 1 
gi = ~bi — Bxi] = ei, 
x ty 


and find VB) = VG) = Vi.) ih & using the survey design. For an SRS, this results 
in the variance estimator in (4.10). = 


The method in Example 9.2 requires substituting estimators for quantities such as 
B. Note that alternative variance estimators can be derived from (9.2). In particular, 
if t, is known, it can be used in place of an estimator 7, in the denominator of di 
giving V5 (B) 7 Vii) / oe The estimators VB) and V> (B) are asymptotically equivalent, 
since we expect t,/f, * 1 for large sample sizes. For small samples, V(B) works 
slightly better than V2(B) in many situations (see Exercise 19 of Chapter 4). An 
alternative procedure for deriving linearization variance estimators that results in a 
unique estimator is discussed in Exercise 23. 


Advantages: If the partial derivatives are known, linearization almost always gives 
a variance estimate for a statistic and can be applied in general sampling designs. 
Linearization methods have been used for a long time in statistics, and the theory is 
well developed. Software exists for calculating linearization variance estimates for 
many nonlinear functions of interest such as ratios and regression coefficients; some 
software will be discussed in Section 9.6. 


Disadvantages: Calculations can be messy, and the method is difficult to apply for 
complex functions involving weights. You must either find analytical expressions for 
the partial derivatives of h or calculate the partial derivatives numerically. A separate 
variance formula is needed for each nonlinear statistic that is estimated, and that can 
require much special programming; a different method is needed for each statistic. 
In addition, not all statistics can be expressed as a smooth function of the population 
totals—the median and other quantiles, for example, do not fit into this framework. 
The accuracy of the linearization approximation depends on the sample size—the 
variance estimator is often biased downwards if the sample is not large enough. 


2 
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Random Group Methods 
§2.1 Replicating the Survey Design 


EXAMPLE 9.3 


Suppose the basic survey design is replicated independently R times. Independently 
here means that each of the R sets of random variables used to select the sample 
is independent of the other sets—after each sample is drawn, the sampled units are 
replaced in the population so they are available for later samples. Then the R replicate 
samples produce R independent estimates of the quantity of interest; the variability 
among those estimates can be used to estimate the variance of 6. Mahalanobis (1939, 
1946) describes early uses of the method, which he calls “replicated networks of 
sample units” and “interpenetrating sampling.” 
Let 


D 
| 


= parameter of interest 


6, = estimate of 6 calculated from rth replicate 
R 
1 fe 
va, 
r=1 


If 6, is an unbiased estimator of 0, so is 6, and 


6= 


>| 


it, il. A eee 3 
V0) = Sapo (6, — 6) (9.4) 
r=1 


is an unbiased estimator of V(6). Note that v; (6) is the sample variance of the R 
independent estimators of 6 divided by R—the usual estimator of the variance of a 
sample mean. 


The 1991 Information Please Almanac listed enrollment, tuition, and room-and-board 
costs for every four-year college in the United States. Suppose we want to estimate the 
ratio of nonresident tuition to resident tuition for public colleges and universities in the 
United States. In a typical implementation of the random group method, independent 
samples would be chosen using the same design, and 6 found for each sample. Let’s 
take four SRSs of size 10 each. The four SRSs are without replacement, but the same 
college can appear in more than one of the four SRSs. The data are in file college9 | .dat, 
with summary statistics for the four SRSs in Table 9.1. 
For this example, 


P average of nonresident tuitions for sample i 


— 


average of resident tuitions for sample i 


so 6; = 2.3288, 6) = 2.5802, 43 = 2.4591, and 64 = 3.1110. The sample average of the 
four independent estimates of 6 is = 2.6198. The sample standard deviation of the 
four estimates is 0.343, so the standard error of 6 is 0.343//4 = 0.172. The variance 
is estimated from four independent observations, so a 95% CI for the ratio is 


2.6198 + 3.18(0.172) = [2.07, 3.17], 
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TABLE 9.1 
Summary Statistics for Four SRSs of Colleges, Used in Example 9.3 


Average Average Average 
Sample Number Enrollment Resident Tuition Nonresident Tuition 
Sample 1 6934.2 1559.0 3630.6 
Sample 2 6968.6 1505.2 3883.7 
Sample 3 4790.2 1527.5 3756.3 
Sample 4 8613.0 1527.1 4750.8 


where 3.18 is the ¢ critical value with 3 degrees of freedom (df). Note that the small 
number of replicates causes the CI to be wider than it would be if more replicate 
samples were taken, because the estimate of the variance with 3 df is not very 
stable. 


§22 Dividing the Sample into Random Groups 


EXAMPLE 9.4 


In practice, subsamples are not usually drawn independently, but the complete sample 
is selected according to the survey design. The complete sample is then divided into 
R groups, so that each group forms a miniature version of the survey, mirroring the 
sample design. The groups are then treated as though they are independent replicates 
of the basic survey design. This method was first described by Hansen et al. (1953, 
p. 440). 

If the sample is an SRS of size n, the groups are formed by randomly apportioning 
the n observations into R groups, each of size n/R. These pseudo-random groups are 
not quite independent replicates because an observation unit can only appear in one 
of the groups; if the population size is large relative to the sample size, however, the 
groups can be treated as though they are independent replicates. In a cluster sample, 
the psus are randomly divided among the R groups. The psu takes all its observation 
units with it to the random group, so each random group is still a cluster sample. In 
a stratified multistage sample, a random group contains a sample of psus from each 
stratum. Note that if k psus are sampled in the smallest stratum, at most k random 
groups can be formed. 

If 6 is a nonlinear quantity, @ will not, in general, be the same as 6, the estima- 
tor calculated directly from the complete sample. For example, in ratio estimation, 
6= (1/R) ee LY¢/%rs while 6 = y/X. Usually, 6 is a more stable estimator than 6. 
Sometimes Vi (6) is used to estimate V(6), although it is an overestimate. Another 
estimator of the variance is slightly larger, but is often used: 


¥2(6) = mb (6, — oY (9.5) 


The 1987 Survey of Youths in Custody, discussed in Example 7.7, was divided into 
seven random groups. The survey design had 16 strata. Strata 6-16 each consisted of 
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one facility (= psu), and these facilities were sampled with probability one. In strata 
1-5, facilities were selected with probability proportional to number of residents in 
the 1985 Children in Custody census. 

It was desired that each random group be a miniature of the sampling design. For 
each self-representing facility in strata 6-16, random group numbers were assigned as 
follows: The first resident selected from the facility was assigned a number between 
1 and 7. Let’s say the first resident was assigned number 6. Then the second resident 
in that facility would be assigned number 7, the third resident 1, the fourth resident 
2, and so on. In strata 1-5, all residents in a facility (psu) were assigned to the same 
random group. Thus for the seven facilities sampled in stratum 2, all residents in 
facility 33 were assigned random group number |, all residents in facility 9 were 
assigned random group number 2, and so on. Seven random groups were formed 
because strata 2 through 5 each have seven psus. 

After all random group assignments were made, each random group had the same 
basic design as the original sample. Random group 1, for example, forms a stratified 
sample in which a (roughly) random sample of residents is taken from the self- 
representing facilities in strata 6-16, and an unequal-probability sample of facilities 
is taken from each of strata 1-5. 

To use the random group method to estimate a variance, 6 is calculated for each 
random group. The following table shows estimates of mean age of residents for each 
random group (SAS code for these calculations is given on the website); each estimate 
was calculated using 


where w; is the final weight for resident i, and the summations are over observations 
in random group r. 


Random Group Number Estimate of Mean Age, 6, 


16.55 
16.66 
16.83 
16.06 
16.32 
17.03 
17.27 


ANDNFWN KH 


The seven estimates of 6 are treated as independent observations, so 


and 
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Using the entire data set, we calculate 6 = 16.64 with 


Re fl (cle oe a 0.1716 
V2(6) = = 2- 6, — 6} = —_— = 0.025. 
26) {eb | = 


r=1 
We can use either 6 or 4 to calculate Cls; using 6, a 95% CI for mean age is 


16.64 + 2.45V0.025 = [16.3, 17.0] 


(2.45 is the ¢ critical value with 6 df). = 


Advantages: No special software is necessary to estimate the variance, and it is very 
easy to calculate the variance estimate. The method is well-suited to multiparameter 
or nonparametric problems. It can be used to estimate variances for percentiles and 
nonsmooth functions as well as variances of smooth functions of the population totals. 
Random group methods are easily used after weighting adjustments for nonresponse 
and undercoverage. 


Disadvantages: The number of random groups is often small—this gives imprecise 
estimates of the variances (see Exercise 18). If 6 is a nonlinear statistic, 9 can have 
large bias if the number of observations in each group is small. Generally one would 
like at least ten random groups to obtain a more stable estimate of the variance and to 
avoid inflating the CI by using a critical value from at distribution with few df. Setting 
up the random groups can be difficult in complicated designs, as each random group 
must have the same design structure as the complete survey. The survey design may 
limit the number of random groups that can be constructed; if two psus are selected 
in each stratum, then only two random groups can be formed. 


3 
Resampling and Replication Methods 


Random group methods are easy to compute and explain but are unstable if a complex 
sample can only be split into a small number of groups. Resampling methods treat 
the sample as if it were itself a population; we take different samples from this new 
“population” and use the subsamples to estimate the variance. All of the methods in 
this section calculate variance estimates for a sample in which psus are sampled with 
replacement. If psus are sampled without replacement, these methods may still be 
used, but are expected to overestimate the variance and result in conservative Cls, as 
discussed in Section 6.4.3. 


§3.1 Balanced Repeated Replication (BRR) 


Some surveys are stratified to the point that only two psus are selected from each 
stratum. This gives the highest degree of stratification possible while still allowing 
calculation of variance estimates in each stratum. 
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TABLE 9.2 
A Small Stratified Random Sample, Used to Illustrate BRR 


Nh = 
Stratum — Yat yr2 Yn Yat — Yr2 
N 
1 0.30 2,000 1,792 1,896 208 
2 0.10 4,525 4,735 4,630 —210 
| 0.05 9,550 14,060 11,805 —4,510 
4 0.10 800 1,250 1,025 —450 
> 0.20 9,300 7,264 8,282 2,036 
6 0.05 13,286 12,840 13,063 446 
7 0.20 2,106 2,070 2,088 36 


9.3.1.1 BRR ina Stratified Random Sample 


We illustrate BRR for a problem we already know how to solve—calculating the vari- 
ance for y,,, from a stratified random sample. More complex statistics from stratified 
multistage samples are discussed in Section 9.3.1.2. 

Suppose an SRS of two observation units is chosen from each of seven strata. We 
arbitrarily label one of the sampled units in stratum / as y;;, and the other as y;2. The 
sampled values are given in Table 9.2. 

The estimated population mean is 


H 


= Nn_ 
Ystr = > Wot = 4451.7. 
h=1 


Ignoring the finite population corrections (fpcs) in (3.5) gives the variance estimator 
H 2.2 
a = Np S 
Vostr Veer) = dX, (>) a 


when n;, = 2, as here, - = (vn — yz)? /2, so 


H 2 2 
a Nn \~ nt = Yr2) 
Vote Veer) = y ( ) - . 
h=1 N 4 


Here, Vil Gi) = 55,892.75. This may overestimate the variance if sampling is without 
replacement. 

To use the random group method, we would randomly select one of the observa- 
tions in each stratum for group | and assign the other to group 2. The groups in this 
situation are half-samples. For example, group 1 might consist of {y11, y22, ¥32, 42. 51> 
Yo2, Y71} and group 2 of the other seven observations. Then, 


6 = (0.3)(2000) + (0.1)(4735) + --- + (0.2)(2106) = 4824.7 
and 


65 = (0.3)(1792) + (0.1)(4525) +--+ + (0.2)(2070) = 4078.7. 
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The random group estimate of the variance—in this case, 139,129—has only | df for 
a two-psu-per-stratum design and is unstable in practice. If a different assignment of 
observations to groups had been made—had, for example, group | consisted of yp; 
for strata 2, 3, and 5 and yy, for strata 1, 4, 6, and 7—then 61 = 4508.6, 6 = 4394.8, 
and the random group estimate of the variance would have been 3238. 

McCarthy (1966, 1969) notes that altogether 2” possible half-samples could be 
formed, and suggests using a balanced sample of the 2” possible half-samples to 
estimate the variance. Balanced repeated replication uses the variability among R 
replicate half-samples that are selected in a balanced way to estimate the variance 
of 0. 

To define balance, let’s introduce the following notation. Half-sample r can be 
defined by a vector a, = (@;1,...,@,H): Let 


ifa.,= 1 
Yala) = Pe i 


yn fay, = —1. 
Equivalently, 
( ) Ayn + 1 Arh — 1 
a,) = ; 
Yh 5) Nal >) Yh2 


If group 1 contains observations {y11, y22, ¥32, 42, ¥51,.62, Y71} aS above, then a, = 
dd, —1,-—1,-1, 1, —-1, 1). Similarly, «2 =(—1,1,1,1,-—1, 1, —1). The set of R repli- 
cate half-samples is balanced if 


R 
> An, =O foralll £h. 


r=1 


For replicate r, calculate 6(a,) the same way as 6 but using only the observations 
in the half-sample selected by w,. For estimating the mean of a stratified random 
sample, 0(@,.) = ae (Ni /N)y;(e,). Define the BRR variance estimator to be 


An, Biles deere 7 
Vare(@®) = 5 > [6(,) — OP. 


r=1 


If the set of half-samples is balanced, then for stratified random sampling Varr Ver) a 
Vice). (The proof of this is left as Exercise 19.) If, in addition, yy a, = 0 for 
h=1,...,H, then 4 OP, Veg (Or) = Fer 

For our example, the set of «’s in the following table meets the balancing condi- 
tion poe Qn, = 0 for all 1 Ah. The 8 x 7 matrix of —1’s and 1’s has orthogonal 
columns; in fact, it is the design matrix (excluding the column of ones) for a fractional 
factorial design (Box et al., 1978), called a Hadamard matrix. Wolter (1985) gives 
more detail on constructing these matrices. 
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Stratum (/) 


1 2 3 4 5 6 7 

oy -1 -1 -1 1 1 1 -1 

ly 1 1 1 if 1 1 1 

3 —1 1 -1 -1 1 —1 1 

Half-sample 4 1 1 -1 1 -1 -1 -1 
(r) Os —1 -1 1 1 -1 —1 1 
6 1 — 1 —-1 1 —1 -1 

a7 -1 1 1 —1 -1 1 —1 

alg 1 1 1 1 1 1 1 


The estimate from each half-sample, 6. = Yetr(O-), 18 calculated from the data in 
Table 9.2. 


Half-sample O(oe,) [A(a,) — OP 
1 4732.4 78,792.5 
2 4439.8 141.6 
3 4741.3 83,868.2 
4 4344.3 11,534.8 
5 4084.6 134,762.4 
6 4592.0 19,684.1 
7 4123.7 107,584.0 
8 4555.5 10,774.4 
average 4451.7 55,892.8 


The average of [O(a,) = yr for the eight replicate half-samples is 55,892.75, which 
is the same as Via) for sampling with replacement. Note that we can calculate 
the BRR variance estimate by creating a new variable of weights for each replicate 
half-sample. The sampling weight for observation i in stratum h is wyj = N;,/np, and 


Define 
2wp; if observation i of stratum / is in 
Whi (Qy) = the half-sample selected by «, 
0) otherwise. 
Then 


H 2 
So wie ni 


= h=1 i=1 
Vstr(y) = 


H 2 : 
> Se Whi(@;) 
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TABLE 9.3 
Data Structure After Sorting 


Observation Stratum psu ssu Weight, Response Response Response 
Number Number Number Number Wi Variable 1 Variable 2 Variable 3 
1 1 1 1 W1 nal xX] Uj 
2 1 1 2 w2 y2 X2 uz 
3 1 1 3 W3 ¥3 X3 U3 
4 1 1 4 Ww4 Y4 X4 U4 
5 1 2 1 Ws V5 Xs Us 
6 1 2 2 W6 Y6 X6 U6 
7 1 2 3 W7 7 x7 uz 
8 1 2) 4 weg Vg Xg ug 
9 1 2 5 Wo yo X9 ug 
10 2 1 1 W10 Yio X10 U0 
11 2 1 2 Wi Yu X11 uit 

Etc. 


Similarly, for any statistic 6 calculated using the weights wy;, 6(c,) is calculated 
exactly the same way, but using the new weights w,;(@,). Using the new weight 
variables instead of selecting the subset of observations simplifies calculations for 
surveys with many response variables—the same column w(q@,) can be used to find 
the rth half-sample estimate for all quantities of interest. The modified weights also 
make it easy to extend the method to stratified multistage samples. SAS software will 
print the Hadamard matrix defining the half-samples and construct replicate weights; 
code for analyzing the data in Table 9.2 using BRR is on the website. 


9.3.1.2 BRR ina Stratified Multistage Survey 


When y,, is the only quantity of interest in a stratified random sample, BRR is simply a 
fancy method of calculating the variance in (3.5) and adds little extra to the procedure 
in Chapter 3. BRR’s value in a complex survey comes from its ability to estimate the 
variance of a general population quantity 0, where @ may be a ratio of two variables, 
a correlation coefficient, a quantile, or another quantity of interest. 

Suppose the population has H strata, and two psus are selected from stratum h with 
unequal probabilities and with replacement. (In replication methods, we like sampling 
with replacement because the subsampling design does not affect the variance esti- 
mator, as we saw in Section 6.3.) The same method may be used when sampling is 
done without replacement in each stratum, but the estimated variance is expected to 
be larger than the without-replacement variance. The data file for a complex survey 
with two psus per stratum often resembles that shown in Table 9.3, after sorting by 
stratum and psu. 

The vector a, defines the half-sample r: If a,, = 1, then all observation units in 
psu | of stratum / are in half-sample r. If w,, = —1, then all observation units in psu 2 
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of stratum / are in half-sample r. The vectors a, are selected in a balanced way, 
exactly as in stratified random sampling. Now, for half-sample r, create a new column 
of weights w(a,): 


2w; if observation unit i is in half-sample r 
WAG) = | 0 otherwise. 8) 
For the data structure in Table 9.3 with a@,; =—1 and a,.=1, the column w(@,) 


will be 
(0, 0, 0, 0, 2w5, 2w6, 2w7, 2wg, 2Wo, 2Wi0, 2Wi1,.--). 


Now use the column w(e@,) instead of w to estimate quantities for half-sample r. 
The estimate of the population total of y for the full sample is }°;..5 wiyi; the esti- 
mate of the population total of y for half-sample r is }°;..5 wi(o,)yi. If 0= t/t, 
then 6= ies Wii! Vies Wixi, and O(c.) = ies WilOri/ Vics Wiloe,)xi. We saw 
in Section 7.3 that the empirical distribution function is calculated using the weights: 


sum of w; for all observations with y; < y 


F(y) = 
(y) sum of w; for all observations 


Then the empirical distribution using half-sample r is 


sum of w;(e,) for all observations with y; < y 


F,(y) = 
(») sum of w;(a,) for all observations 


If @ is the population median, then 6 may be defined as the smallest value of y for 
which F (vy) => 1/2, and 6(a,) i is the smallest value of y for which F (vy) = 1/2. 
For any quantity 0, we define 


R 


A a 1 a a 
Varr(@) = = | [6(a,) — aP. (9.7) 


r=1 


BRR can also be used to estimate covariances of statistics: If 6 and n are two quantities 
of interest, then 


R 


aes a 1 
Cov pre, ) = = > (ler) — Oller) — fi. 


r=1 


Other BRR variance estimators, variations of (9.7), are described in Exercise 20. 

While the exact equivalence of arr (Veir(@)) and Veoir(Vetr) does not extend to 
nonlinear statistics, Rao and Wu (1985) show that if 9 is a smooth function of the 
population totals, the variance estimator from BRR is asymptotically equivalent to 
that from linearization. BRR also provides a consistent estimator of the variance for 
quantiles when a stratified random sample is taken (Shao and Wu, 1992). 

When a replication method such as BRR is used, data analysts can calculate 
variances from data files without needing to know the stratification and clustering 
information. The public-use data set can consist of the response variables, original 
weights, and the columns of replicate weights. The statistic 6is calculated by using the 
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TABLE 9.4 
NHANES Data with Replicate Weights 


Replicate Replicate Replicate tee Replicate Replicate 

Stratum psu BMI Weight Weight 1 Weight 2 Weight 3 tee Weight 15 = Weight 16 
39 2 50.85 5,824.78 11,649.56 0 OO: 11,649.56 0 
41 1 20.78 5,564.04 11,128.08 O 11,128.08 --:- 11,128.08 0 
33 2 19.60 12,947.34 25,894.68 0 25,894.68  --- 0 25,894.68 
37 1 21.64 7,304.95 14,609.89 0 14,609.89 .-- 0 = 14,609.89 
33 1 15.76 8,385.25 0 = 16,770.49 OO: 16,770.49 0 
33 2 28.32 19,994.16 39,988.32 O 39,988.32  --- 0 39,988.32 
41 1 38.03 15,876.72 — 31,753.44 O 31,753.44 -:- 31,753.44 0 
33 2 26.76 40,061.77 80,123.54 O 80,123.54 -:- 0 80,123.54 


EXAMPLE 9.5 


original weights w; with the data vector of y;’s. Then the columns of replicate weights 
are used to perform the variance estimation: We calculate 6(ce,), forr=1,...,R, by 
performing the same calculations used to find 6, with weights w;(a,) substituted for 
the original weights w;. Equation (9.7) is then applied to estimate the variance of a. 
Weighting adjustments for nonresponse, such as those discussed in Section 8.5, can be 
incorporated into the replicate weights so that the BRR estimate of variance includes 
the effects of the nonresponse and calibration adjustments (Canty and Davison, 1999). 


Let’s use BRR to estimate variances from the NHANES data studied in Section 7.4.2. 
The public-use data set includes variables for pseudo-stratum and pseudo-psu that can 
be used for variance estimation. (The original strata and psu variables are not released 
to the public to preserve confidentiality of the respondents’ data.) Each pseudo-stratum 
has two pseudo-psus, so BRR can be used. We generate replicate weights for these 
data using SAS software. The replicate weights can then be used to calculate standard 
errors for any statistic, not just those in the software package. For example, some 
software packages do not yet calculate medians from survey data; the original weight 
variable can be used to estimate a median, and the replicate weights can then be used 
to estimate the variance of the estimated median. 

Since our replicate weights are based on the final weight variable included in the 
NHANES data, however, they do not incorporate effects of nonresponse adjustment 
on the variance. Many data sets that are made available to the public have replicate 
weights that account for the nonresponse adjustments, and those are preferred if 
available. 

The data set has 15 pseudo-strata, so we use a 16 x 16 Hadamard matrix (16 is the 
first multiple of 4 after 15 for which a Hadamard matrix exists). The replicate weights 
for a few of the observations are given in Table 9.4. Each entry in the replicate weight 
columns is either 0 or 2w;. Note that the pattern of replicate weights is the same for 
each of the three observations from pseudo-psu 2 of pseudo-stratum 33. 

Using the replicate weights, and the SAS code given on the website, we estimate 
the mean and the median first using the original weight vector and then using each of 
the 16 vectors of replicate weights. Using the original weight vector, we estimate the 
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mean body mass index (variable bmxbmi) as y = 26.19 and the median as m = 25.60. 
The values calculated using the replicate weights are: 


Replicate Weight 
1 2 3 4 5 6 7 8 
Mean 26.3364 25.9977 26.0700 26.0741 26.4049 26.2483 25.9767 26.3310 
Median 25.64 25.31 25.49 25.39 25.67 25.56 25.4 25.76 
9 10 11 12 13 14 15 16 
Mean 26.3255 25.9909 26.1574 26.1584 26.1406 26.0324 26.2851 26.4078 
Median 25.83 25.38 25.53 25.50 25.62 25.56 25.68 25.88 


Using (9.7) we estimate Vo) = 0.0215 and v(m) = 0.026. The estimates from the 
replicate weights tend to be close to each other; to avoid roundoff error, it is best to 
do these calculations on the computer. = 


Advantages: BRR gives a variance estimator that is asymptotically equivalent to that 
from linearization methods for smooth functions of population totals. It can also be 
used for estimating variances of quantiles. The data analyst only needs the columns 
of replicate weights, and does not need the original sampling design information, 
to calculate variances. It requires relatively few computations (and relatively few 
columns of replicate weights) when compared with the jackknife and the bootstrap. 


Disadvantages: As defined above, BRR can only be used in situations in which there 
are two psus per stratum. In practice, though, it is often extended to other sampling 
designs by using more complicated balancing schemes (see Fay, 1989 and Judkins, 
1990). BRR, like the jackknife and bootstrap, estimates the with-replacement vari- 
ance, and may overestimate the without-replacement variance. 


932 Jackknife 


The jackknife method, like BRR, extends the random group method by allowing the 
replicate groups to overlap. The jackknife was introduced by Quenouille (1956) as a 
method of reducing bias; Tukey (1958) proposed using it to estimate variances and 
calculate CIs. In this section, we describe the delete-1 jackknife; Shao and Tu (1995) 
discuss other forms of the jackknife and give theoretical results. 

For an SRS, let A be the estimator of the same form as 6, but not using observation 
j. Thus, if 6=y, then 6) =¥y = )o;z;)i/(n — 1). For an SRS, define the delete-1 
jackknife estimator (so called because we delete one observation in each replicate) as 


ae ae ne ee 
Vix (6) = —— 8, 
1x @) = — a (Gy) — 8) (9.8) 
j=l 
Why the multiplier (n — 1)/n? Let’s look at V;x(6) when 6 =. When 6 =j, 


= 1 1 (<x eo 7 
yy = 1d (>=) = Veg 


EXAMPLE 9.6 
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TABLE 9.5 
Jackknife Calculations for Example 9.6 


J x y X(j) YG) Bi 
1 1365 3747 1580.6 3617.7 2.2889 
2 1677 4983 1545.9 3480.3 2.2513 
3 1500 1500 1565.6 3867.3 2.4703 
4 1080 2160 1612.2 3794.0 2.3533 
5 1875 2475 1523.9 3759.0 2.4667 
6 3071 5135 1391.0 3463.4 2.4899 
7 1542 3950 1560.9 3595.1 2.3032 
8 930 4050 1628.9 3584.0 2.2003 
9 1340 4140 1583.3 3574.0 2.2573 
10 1210 4166 1597.8 3571.1 2.2350 
Then, 


n 1 n 1 
— =\2 _ __ 7) — 2 
280 y) ~ Gane LO i = 


so Vix(y) = sx. /n, the with-replacement estimator of the variance of y. 


Let’s use the jackknife to estimate the ratio of nonresident tuition to resident tuition 
for the first group of colleges in Example 9.3. Here, 6 = y/X, 0) = By = ¥j)/X(j), and 

Ae a n—1 a a 

V, =—— ) — BY. 

xB) = —— ) | (By — B) 
JES 
For each jackknife group in Table 9.5, omit one observation. Thus, x(;) is the 

average of all x’s except for x: X1) = (1/9) )--_, x). Here, B = 2.3288, > (By) —B)? = 
0.1043, and Vjx(B)=.09377. = 


How can we extend this to a cluster sample? One might think that you could just 
delete one observation unit at a time, but that will not work—deleting one observa- 
tion unit at a time destroys the cluster structure and gives an estimate of the variance 
that is only correct if the intraclass correlation coefficient is zero. In any resam- 
pling method and in the random group method, keep observation units within a psu 
together while constructing the replicates—this preserves the dependence among 
observation units within the same psu. For a cluster sample, then, we would apply the 
jackknife variance estimator in (9.8) by letting 1 be the number of psus, and letting Gi 
be the estimate of 6 that we would obtain by deleting all the observations in psu /. 

In astratified multistage cluster sample, the jackknife is applied separately in each 
stratum at the first stage of sampling, with one psu deleted at a time. Suppose there 
are H strata, and nj, psus are chosen for the sample from stratum h. Assume these 
psus are chosen with replacement. 


EXAMPLE 9.7 
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To apply the jackknife, delete one psu at a time. Let Ani) be the estimator of the 
same form as 6 when psu j of stratum h is omitted. To calculate Binjs define a new 
weight variable: Let 


Wj if observation unit i is not in stratum h 


 _ Jo if observation unit i is in psu j of stratum h 
Wi(hj) = Np 


i w; _ if observation unit 7 is in stratum / but not in psu /. 
Nh — 


Then use the weights wij, to calculate Bins and 


Vix) = at a = Bury — 8. (9.9) 


h= 


Here we use the jackknife to calculate the variance of the mean egg volume from 
Example 5.7. We calculated 6= y,. = 4375.947/1757 = 2.49. In that example, since 
we did not know the number of clutches in the population, we calculated the with- 
replacement variance. 

First, we find the weight vector for each of the 184 jackknife iterations. We have 
only one stratum, so ) = 1 for all observations. For 6.1), delete the first psu. Thus the 
new weights for the observations in the first psu are 0; the weights in all remaining 
psus are the previous weights times n;,/(m, — 1) = 184/183. Using the weights from 
Example 5.7, the new jackknife weight columns are shown in Table 9.6. 

Note that the sums of the jackknife weights vary from column to column because 
the original sample is not self-weighting. We calculated 6 as ( > wiyi)/ 2 wis 


TABLE 9.6 
Jackknife Weights for Example 5.7. The values w; are the relative weights; wi. is the set of 
jackknife weights for the replication omitting psu k. 


clutch csize Wi Wi(1) Wi(2) sa5 Wi(184) 
1 13 6.5 0 6.535519 a 6.535519 
1 13 6.5 0 6.535519 or 6.535519 
2 13 6.5 6.535519 0 rae 6.535519 
2 13 6.5 6.535519 0 ian 6.535519 
3 6 3 3.016393 3.016393 ie 3.016393 
3 6 3 3.016393 3.016393 aoe 3.016393 
4 11 5:5 5.530055 5.530055 ae 5.530055 
4 11 5:5 5.530055 5.530055 ad 5.530055 
183 13 6.5 6.535519 6.535519 _ 6.535519 
183 13 6.5 6.535519 6.535519 ah 6.535519 
184 12 6 6.032787 6.032787 ie 0 
184 12 6 6.032787 6.032787 mae 0 
Sum 3514 1757 1753.53 1753.53 Soe 1754.54 


EXAMPLE 9.8 
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to find Ani); we follow the same procedure but use wij) in place of w;. Thus, 
6a.) = 4349.348/1753.53 = 2.48034; 6.1.2) = 4345.036/1753.53 = 2.47788; 61,184) = 
4357.819/1754.54 = 2.48374. Using (9.9), then, we calculate Vix (6) = 0.00373. This 
results in a standard error of 0.061, the same as calculated in Example 5.7. SAS code 
for constructing jackknife weights and finding the jackknife estimate of the variance 
is given on the website. sm 


We used the random group method to estimate the variance of mean age of residents 
for the Survey of Youth in Custody in Example 9.4. The jackknife can also be used 
to estimate the variance. SAS code on the website results in the following output: 


Data Summary 


Number of Strata 16 
Number of Clusters 861 
Number of Observations 2621 
Sum of Weights 25012 


Variance Estimation 


Method Jackknife 
Number of Replicates 861 
Statistics 
Std Error 
Variable Mean of Mean 95% CL for Mean 
age 16 .639293 0.130106 16 .3839236 16.8946626 


The standard error from the jackknife method differs from that given by the random 
group method; the random group standard error is based on only 7 groups and is less 
stable than the jackknife standard error. = 


Advantages: The jackknife is an all-purpose method. The same procedure is used to 
estimate the variance for every statistic for which jackknife can be used. The jackknife 
works in stratified multistage samples in which BRR does not apply because more than 
two psus are sampled in each stratum. The jackknife provides a consistent estimator 
of the variance when @ is a smooth function of population totals (Krewski and Rao, 
1981). Replication methods such as the jackknife can be used to account for some of 
the effects of imputation on the variance estimates (Rao and Shao, 1992). 


Disadvantages: For some sampling designs, the jackknife may require a large amount 
of computation. The jackknife performs poorly for estimating the variances of some 
statistics that are not smooth functions of population totals. For example, the jackknife 
does not give a consistent estimator of the variance of quantiles in an SRS. 
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43d Bootstrap 


EXAMPLE 99 


As with the jackknife, theoretical results for the bootstrap were first developed for 
areas of statistics other than survey sampling; Shao and Tu (1995) summarize theoreti- 
cal results for the bootstrap in complex survey samples. We first describe the bootstrap 
for an SRS with replacement, as developed by Efron (1979, 1982) and described in 
Davison and Hinkley (1997). Suppose S is an SRS with replacement of size n. We 
hope, in drawing the sample, that it reproduces properties of the whole population. 
We then treat the sample S as if it were a population, and take resamples from S. If 
the sample really is similar to the population—if the empirical probability mass func- 
tion of the sample is similar to the probability mass function of the population—then 
samples generated from the empirical probability mass function should behave like 
samples taken from the population. 


Let’s use the bootstrap to estimate the variance of the median height, 6, in the height 
population from Example 7.3, using the sample in the file ht.srs. The population 
median height is 6 = 168; the sample median from ht.srs is 6= 169. Figure 7.2, the 
probability mass function for the population, and Figure 7.3, the histogram of the 
sample, are similar in shape (largely because the sample size for the SRS is large), so 
we would expect that taking an SRS of size n with replacement from S would be like 
taking an SRS with replacement from the population. A resample from S, though, 
will not be exactly the same as S because the sample is with replacement—some 
observations in S may occur twice or more in the resample, while other observations 
in S may not occur at all. 

We take an SRS of size 200 with replacement from S to form the first resample. 
The first resample from S has an empirical probability mass function similar to but 
not identical to that of S; the resample median is oe = = 170. Repeating the process, the 
second resample from S has median 6s = = 169. We take a total of R= 2000 resamples 
from S and calculate the sample median from each sample, obtaining 6, 6s, se Ox. 
We obtain the following frequency table for the 2000 resample medians: 


Median of 
Resample | 165.0 166.0 166.5 167.0 167.5 168.0 168.5 169.0 169.5 170.0 170.5 171.0 171.5 172.0 
Frequency | 5 2 40 15 268 87 739 Ill 491 44 188 5 4 


The sample mean of these 2000 values is 169.3 and the sample variance of these 
2000 values is 0.9148; this is the bootstrap estimate of the variance of the sample 
median. An approximate 95% CI may be constructed using the bootstrap variance as 
169.3 + 1.96. .9148 = [167.4, 171.2]. Alternatively, the bootstrap distribution may 
be used to calculate a CI directly. The bootstrap distribution estimates the sampling 
distribution of 6, so. a 95% percentile CI may be calculated by finding the 2.5 percentile 
and the 97.5 percentile of the bootstrap distribution. For this example, a 95% percentile 
CI for the median is [167.5, 171]. Manly (1997) describes other methods for finding 
CIs using the bootstrap. = 
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If the original SRS is without replacement, Gross (1980) proposes creating N/n 
copies of the sample to form a “pseudopopulation,” then drawing R SRSs without 
replacement from the pseudopopulation. If n/N is small, the with-replacement and 
without-replacement bootstrap distributions should be similar. 

Shao (2003) describes methods for using the bootstrap with data from a complex 
survey. In all of the methods, we take bootstrap resamples of the psus within each 
stratum. As with BRR and the jackknife, observations within a psu are always kept 
together in the bootstrap iterations. 

Here are steps for using the rescaling bootstrap of Rao and Wu (1988) for a 
stratified multistage sample. Let n, be the number of psus sampled from stratum h. 
Let R be the number of bootstrap replicates to be created. Typically, R= 500 or 1,000, 
although some statisticians use smaller values of R. 


1 For bootstrap replicate r (r = 1,..., R), select an SRS of nj, — 1 psus with replace- 
ment from the n, sample psus in stratum h. Do this independently for each stratum. 
Let mp;(r) be the number of times psu j of stratum / is selected in replicate r. 


2 Create the replicate weight vector for replicate r as 


w(r) = w; X mj(r), for observation i in psu j of stratum h. 


Nn 

Ny — 1 
The result is R vectors of replicate weights. 

3 Use the vectors of replicate weights to estimate V(6). Let 6 be the estimator of 


8, calculated the same way as @ but using weights w,(r) instead of the original 
weights w;. Then, 


R 
et te 1 ie im 
Ve) = 5 1G - 6y. 
i=1 


EXAMPLE 9.10 We use the bootstrap to estimate variances from the data in file htstrat.dat, discussed 
in Example 7.3. The bootstrap weights are constructed by taking 1000 stratified ran- 
dom samples with replacement from the data set; we select 159 women and 39 men 
with replacement in each resample. The average height is estimated by y,,, = 169.02 
with bootstrap standard error 0.737; the standard error calculated using the stratified 
sampling formula in (3.5), ignoring the fpc, is 0.739. The SAS macro on the website 
uses SAS PROC SURVEYSELECT to construct the replicate bootstrap weights. = 


EXAMPLE 9.11 We noted in Section 8.6 that if a data set has imputed values and then is analyzed as 
if those imputed values were real, the resulting variance estimate is too low. Repli- 
cation methods such as bootstrap can be used to account for some of the effects of 
imputation on the variance estimates. Zhang et al. (1998) use the bootstrap with the 
1993-94 Schools and Staffing Survey to estimate the amount that imputation inflated 
the variance estimates. The survey uses several types of imputation, including hot- 
deck imputation. They found that the standard errors calculated accounting for the 
imputation could be up to twice as large as if the imputation were ignored. = 


gf 
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Advantages: The bootstrap will work for smooth functions of population means and 
for some nonsmooth functions such as quantiles in general sampling designs. The 
bootstrap is well suited for finding CIs directly: To calculate a 90% CI, one can 
merely take the Sth and 95th percentiles from 6*, 6%,..., 6%, or can use a bootstrap-t 
method such as that described in Efron (1982). 


Disadvantages: In some settings, the bootstrap may require more computations than 
BRR or jackknife, since R is typically a very large number. In other large surveys, 
however, for example if a stratified random sample is taken, the bootstrap may require 
fewer computations than the jackknife. The bootstrap variance estimate differs when 
a different set of bootstrap samples is taken. 


Generalized Variance Functions 


In many large government surveys such as the U.S. Current Population Survey (CPS) 
or the Canadian Labour Force Survey, hundreds or thousands of estimates are calcu- 
lated and published. The agencies analyzing the survey results could calculate stan- 
dard errors for each published estimate and publish additional tables of the standard 
errors, but that would add greatly to the labor involved in publishing timely estimates 
from the surveys. In addition, other analysts of the public-use data files may wish 
to calculate additional estimates, and the public-use files may not provide enough 
information to allow calculation of standard errors. 

Generalized variance functions (GVFs) are provided in a number of surveys to 
calculate standard errors. They have been used for the CPS since 1947. Wolter (2007, 
Chapter 7) describes the theory underlying GVFs. 

Criminal Victimization in the United States, 1990 (U.S. Department of Justice, 
p. 146), gives GVF formulas for calculating standard errors in the 1990 National Crime 
Victimization Survey (NCVS). If 7 is an estimated number of persons or households 
victimized by a particular type of crime, or if? estimates a total number of victimization 
incidents, 


V@ =a? + bi. (9.10) 


If p is an estimated proportion, 


sl & 


Vp) = = pl —p), (9.11) 
where 7 is the estimated base population for the proportion. For the 1990 NCVS, the 
values of a and b were a= — 0.00001833 and b = 3725. For example, for 1990 it was 
estimated that 1.23% of persons aged 20 to 24 were robbed, and it was also estimated 
that there were 18,017,100 persons in that age group. Thus the GVF estimate of 
SE(p) is 


3725 
a 0130 = 6123). = C0016: 
18,017,100 
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Assuming that asymptotic results apply, this gives an approximate 95% Cl of 0.0123 
(1.96)(0.0016), or [0.0091, 0.0153]. There were an estimated 800,510 completed 
robberies in 1990. Using (9.10), the standard error of this estimate is 


¥ (—0.00001833)(800,510)2 + 3725(800,510) = 54,499. 


Where do these formulas come from? Suppose f; is the total number of observation 
units belonging to a class, say the total number of persons in the United States who 
were victims of violent crime in 1990. Let p; = t;/N, the proportion of persons in the 
population belonging to that class. If d; is the design effect (deff) in the survey for 
estimating p; (see Section 7.5), then 


F Pil —pi) _ bi 
Vp) © di = —pi(l — pi), (9.12) 
n N 
where b; = d; x (N/n). Similarly, 
V(t) © gee = ajt; + biti, 
n 
where a; = —d;/n. If estimating a proportion in a domain, say the proportion of persons 


in the 20-24 age group who were robbery victims, the denominator in (9.12) is 
changed to the estimated population size of the domain (see Section 4.2). 

If the deffs are similar for different estimates so that a; ~ a and b; ~ b, then 
constants a and b can be estimated that give (9.10) and (9.11) as approximations to 
the variance for a number of quantities. The general procedure for constructing a 
generalized variance function is as follows: 


1 Using replication or some other method, estimate variances for k population totals 
of special interest, ti, fo,...,¢. Let v; be the relative variance for f;, v; = V(i)/#?, 
fori=1,2,...,k. 

2 Postulate a model relating v; to 7;. The 1990 NCVS and many other surveys use 
the model 


waate. (9.13) 


This is a linear regression model with response variable v; and explanatory variable 
1/ t;. Valliant (1987) found that this model produces consistent estimators of the 
variances for the class of superpopulation models he studied. 


3 Use regression techniques to estimate a and 6 by a and b. Valliant (1987) suggests 
using weighted least squares to estimate the parameters, giving higher weight to 
items with small v;. 


4 Use the estimated regression equation to predict the relative variance of an esti- 
mated total frew: Mrew = A+D / Trew. SINCE Pew is the predicted value of the relative 
variance V(inew)/t2y, the GVE estimate of V(fnew) is V(fnew) = di2., + Dinew- 
The GVF model can also be used to estimate the variance of a percentage, with 
V(p)=p(1 — p)b/?, where 7 is the estimated number of units in the base of the 


percentage (see Exercise 25). 


The a; and b; for individual items are replaced by quantities a and b which are 
calculated from all k items. For the 1990 NCVS, b= 3725. Most weights in the 1990 


4. 
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NCVS are between 1500 and 2500; B approximately equals the (average weight) x 
(deff), if the overall deff is about two. 

The model used in (9.13) is relatively simple; if the deffs vary greatly among 
different responses, the simple model may give inaccurate estimates of variances 
for some responses. Krenzke (1995) describes alternative models considered for the 
NCVS, and provides a good example of the process used to develop a GVF model 
that takes account of nonconstant deffs. 

Valliant (1987) found that if design effects for the k estimated totals are similar, 
the GVF variances are often more stable than the direct estimates of variance, as they 
smooth out some of the fluctuations from item to item. If a quantity of interest does 
not follow the model in Step 2, however, the GVF estimate of the variance is likely 
to be poor, and you can only know that it is poor by calculating the variance directly. 


Advantages: The GVF may be used when insufficient information is provided in the 
public-use data files to allow direct calculation of standard errors. The data collector 
can calculate the GVF, and often has more information for estimating variances than 
is released to the public. A GVF saves a great deal of time and speeds production of 
annual reports. It is also useful for designing similar surveys in the future. 


Disadvantages: The model relating v; to 7; may not be appropriate for the quantity 
you are interested in, resulting in an unreliable estimate of the variance. You must be 
careful about using GVFs for estimates not included when calculating the regression 
parameters. If a subpopulation has an unusually high degree of clustering (and hence 
a high deff), the GVF estimate of the variance may be much too small. 


Confidence Intervals 


5.1 


Confidence Intervals for Smooth Functions of 
Population Totals 


Theoretical results exist for most of the variance estimation methods discussed in 


this chapter, stating that under certain assumptions (@- 0)/ V6) asymptotically 
follows a standard normal distribution. These results and conditions are given in 
Binder (1983), for linearization estimates; in Krewski and Rao (1981) and Rao and 
Wu (1985), for jackknife and BRR; in Rao and Wu (1988) and Sitter (1992), for 
bootstrap. Consequently, when the assumptions are met, an approximate 95% CI for 
@ may be constructed as 


+ 1.96, V(6). 


D> 


Alternatively, a tag percentile may be substituted for 1.96, with df= (number of 
groups — 1) for the random group method, and df= (number of psus — number of 
strata) for the other methods. Rust and Rao (1996) give guidelines for appropriate 
dfs. The bootstrap method may also be used to calculate Cls directly. 
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Roughly speaking, the assumptions for linearization, jackknife, BRR, and boot- 
strap are as follows: 


1 The quantity of interest 6 can be expressed as a smooth function of the popu- 
lation totals; more precisely, 6 =h(t,, t2,...,t,), where the second-order partial 
derivatives of h are continuous. 


2 The sample sizes are large: Either the number of psus sampled in each stratum 
is large, or the survey contains a large number of strata. (See Rao and Wu, 1985, 
for the precise technical conditions needed.) Also, to construct a CI using the 
normal distribution, the sample sizes must be large enough so that the sampling 
distribution of 6 is approximately normal. 


Furthermore, a number of simulation studies indicate that these CIs behave well in 
practice. Wolter (1985) summarizes some of the simulation studies; others are found 
in Kovar et al. (1988) and Rao et al. (1992). These studies indicate that the jackknife 
and linearization tend to give similar estimates of the variance, while the bootstrap 
and BRR give slightly larger estimates. Sometimes a transformation may be used 
so that the sampling distribution of a statistic is closer to a normal distribution: if 
estimating total income, for example, a log transformation may be used because the 
distribution of income is extremely skewed. 


§52 Confidence Intervals for Population Quantiles 


The theoretical results described above for BRR, jackknife, bootstrap, and lineariza- 
tion methods do not apply to population quantiles, however, because they are not 
smooth functions of population totals. Special methods have been developed to con- 
struct CIs for quantiles; McCarthy (1993) compares several CIs for the median, and 
his discussion applies to other quantiles as well. 

Let g be between 0 and 1. Then define the quantile 6, as 0, = F~'(q), where ee '(q) 
is defined to be the smallest value y satisfying F(y) > q. Similarly, define 6, = =F-'(q). 
tes Nr as in Example 7.6, interpolation can be used to define yaantilee ) Now 

! and F-! are not smooth functions, but we assume the population and sample are 
ae enough that they can be well approximated by continuous functions. 

Some of the methods already discussed work quite well for constructing CIs for 
quantiles. The random group method works well if the number of random groups, R, is 
moderate. Let 6,(r) be the estimated quantile from random group r. Then, a CI for 6, is 


R 
0, +t RR — map [6,(r) — 


where ¢ is the appropriate percentile from a ¢ distribution with R — 1 degrees of free- 
dom. Similarly, studies by McCarthy (1993), Kovar et al. (1988), Sitter (1992), Rao 
et al. (1992), and Shao and Chen (1998) indicate that in certain designs CIs can be 
formed using 


A 


6, + 1.96 V(6,), 


where the variance estimate is calculated using BRR or bootstrap. 
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FIGURE 9.2 

Woodruff’s confidence interval for the quantile 6, if the empirical distribution function is 
continuous. Since F(y) is a proportion, we can easily calculate a confidence interval for any 
value of y, shown on the vertical axis. We then look at the corresponding points on the horizontal 
axis to form a confidence interval for 6,. 


Smoothed Fi (y) 


q+ 1.96 {V(F(G,)) 


q 


q- 1.96 {V(F(6,)) 


AN 


Lower 0, Upper 


confidence confidence 
limit for 6, limit for 94 


An alternative interval can be constructed based on a method introduced by 
Woodruff (1952). For any y, F (y) is a function of population totals: F => ics Willi / 
pees w;, where u; = 1 if y; <y and u; =0 if y; > y. Thus, a method in this chapter 
can be used to estimate V[F (y)] for any value y, and an approximate 95% CI for F(y) 


is given by 
Fy) + 1.96) VIFO)). 


Now let’s use the CI for g = F(@,) to obtain an approximate CI for 6,. Since we have 
a 95% CI, 


0.95 + P {F) — 1.96,/V[F@,)] < q < F@,) + 1.96, /V1F Gi} 
=P {4 — 1.96,/V[F@,)] < F(@,) < q+ 1.96/11 Gi} 
=P (#- {4 - 1.96,/VF Gi} <6, < fF" {4 + 1.96/11 Gf) 


So an approximate 95% CI for the quantile 0, is 


a {4 — 1.96 Ti a {4 + 1.96 Tio 


The derivation of this CI is illustrated in Figure 9.2. An appropriate ¢ critical value 
may be substituted for 1.96 if desired. 

We need several technical assumptions to use the Woodruff-method interval. These 
assumptions are stated by Rao and Wu (1987) and Francisco and Fuller (1991), who 
studied a similar CI. Essentially, the problem is that both F and F are step functions; 


EXAMPLE 9.12 
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they have jumps at the values of y in the population and sample. The technical condi- 
tions basically say that the jumps in F and in F should be small, and that the sampling 
distribution of F (y) is approximately normal. Sitter and Wu (2001) show that the 
Woodruff method gives CIs with approximately correct coverage probabilities even 
when q is large or small. 


Let’s use Woodruff’s method to construct a 95% CI for the median height in the file 
htstrat.dat, discussed in Example 7.3. The following values were obtained for the 
empirical distribution function: 


y 165 166 167 168 169 170 171 
Fi) 0.3781 0.4438 0.4844 0.5125 0.5375 0.5656 0.6000 


In Example 7.6, we estimated the population median by 


I open (168 — 167) = 167.6 
a 0.5125 — 0.4844 ea 


Note that 


2 2 
> ) Whitthi ) ) Whitlhi 


AULA h=1 ieSp, h=1 i€Sp, 
FO) = ~ ~2000 
yi 
h=1 i€Sp 


where up; = 1 if yas < 60.5 and 0 otherwise, so, using the variance for the combined 
ratio estimator in Section 4.5, 


Ae 1 5 160 \ 82, ; 40 \ 52, 
V[FO,)] = amor [1000 Y(1- =) rep + (1000? "(1 - 5) se | 
2 


where s“,, is the sample variance of the values e;,; = uj; — 0.5 for stratum h. Using the 
values s2, =0.1789 and s2, = 0.1641 results in V[F'(6p,s)] =0.00121941. Thus, for 


this sample, 1.96, V[F'(6p.s)] = 0.0684. 


The lower confidence bound for the median is then F -1(0.5 — 0.0684), and the 
upper confidence bound for the median is F~'(0.5 + 0.0684). We again use linear 
interpolation to obtain 


P 0.4316 — 0.3781 
F7'(0.4316) = 1 1 165) = 165. 
(0.4316) = 165 + TEC ERE 66 — 165) = 165.8 


and 


x 0.5684 — 0.5656 
F~! (0.5684) = 170 + (171 — 170) = 170.1. 
0.6 — 0.5656 
Thus, an approximate 95% CI for the median is [165.8, 170.1]. 
Some books and software obtain a slightly different CI for the median; these are 
asymptotically equivalent to the CI derived in this section if the underlying population 


4b 


382 Chapter 9: Variance Estimation in Complex Surveys 


distribution function is sufficiently smooth. SAS software calculates the CI 


( {Fe — tag Fé) DS {Fe) + tar v1 G~it ) = [165.6, 169.6]. 


SAS code for producing this CI, with output in Example 7.6, is given on the 
website. m= 


Chapter Summary 


This chapter has briefly introduced you to some basic types of variance estimation 
methods that are used in practice: linearization, random groups, replication, and gen- 
eralized variance functions. But this is just an introduction; you are encouraged to 
read some of the references mentioned in this chapter before applying these meth- 
ods to your own complex survey. Much of the research done exploring properties 
and behavior of these methods has been done since 1980, and variance estimation 
methods are still a subject of research by statisticians. 

Linearization methods are perhaps the most thoroughly researched in terms of 
theoretical properties, and have been widely used to find variance estimates in complex 
surveys. The main drawback of linearization, though, is that the derivatives need 
to be calculated for each statistic of interest, and this complicates the programs for 
estimating variances. If the statistic you are interested in is not handled in the software, 
you must write your own code. 

The random group method is an intuitively appealing method for estimating vari- 
ances. It is easy to explain and to compute, and can be used for almost any statistic of 
interest. Its main drawback is that we generally need enough random groups to have 
a stable estimate of the variance, and the number of random groups we can form is 
limited by the number of psus sampled in a stratum. 

Resampling methods for stratified multistage surveys avoid partial derivatives 
by computing estimates for subsamples of the complete sample. They must be con- 
structed carefully, however, so that the correlation of observations in the same cluster 
is preserved in the resampling. Resampling methods require more computing time 
than linearization but less programming time: the same method is used on all statis- 
tics. They have been shown to be equivalent to linearization for large samples when 
the characteristic of interest is a smooth function of population totals. Resampling 
methods can sometimes capture the variability in weight adjustments used for non- 
response. 

The BRR method can also be used with almost any statistic, but is usually used 
only for two psu per stratum designs, or for designs that can be reformulated into 
two psu per strata. The jackknife and bootstrap can also be used for most estimators 
likely to be used in surveys (exception: the delete-one jackknife does not work well 
for estimating the variance of quantiles), and may be used in stratified multistage 
samples in which more than two psus are selected in each sample, but require more 
computing than BRR. 
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Generalized variance functions fit a model predicting the variance of a quantity 
from other characteristics. They are easy to use, but may give incorrect inferences for 
a Statistic that does not follow the model used to develop the GVF. 

Brogan (2005) reviews software packages that analyze complex survey data. 
SUDAAN (www.rti.org/sudaan), Stata (www.stata.com), SPSS Complex Samples 
(www.spss.com), and SAS (SAS Institute Inc., 2008) software use linearization meth- 
ods to estimate variances of nonlinear statistics. The survey software packages Wes Var 
(www.westat.com) and VPLX (Fay, 1990) both use resampling methods to calculate 
variance estimates. Recent versions of SAS and SUDAAN software also implement 
BRR and jackknife. Several software packages in the R language (R Development 
Core Team, 2008) are freely available at www.r-project.org. Lumley (2000) provides 
a package of R survey functions, using linearization and replication methods; Matei 
and Tillé (2005) give R functions for selecting samples and computing the Horvitz— 
Thompson estimator. The free software [VEware (www.isr.umich.edu/src/smp/ive/) 
uses linearization and replication methods along with multiple imputation for missing 
data. Software for analyzing survey data changes rapidly; the Survey Research Meth- 
ods Section of the American Statistical Association (www.amstat.org/sections/SRMS) 
is a good resource for updated information; click on Links and Resources. 


Key Terms 


Balanced repeated replication: Resampling method for variance estimation used 
when there are two psus sampled per stratum. 


Bootstrap: Resampling method for variance estimation in which samples of psus 
with replacement are taken within each stratum. 


Generalized variance function: A formula for variance estimation constructed based 
on a regression model for the variances. 


Jackknife: Resampling method for variance estimation in which each psu is deleted 
in turn. 


Linearization: A method for estimating the variance of a nonlinear function of esti- 
mated population totals by using a Taylor series expansion. 


For Further Reading 


The methods discussed in this chapter are described in more detail by Rao (1988), 
Rao (1997), Rust and Rao (1996), Shao (2003), Wolter (2007), and Brogan (2005). 

Binder (1983, 1996) presents a general theory for using the linearization method 
of estimating the variance, even when the quantities of interest are defined implicitly. 
Demnati and Rao (2004) derive linearization variance estimators using the weights. 

Rao and Wu (1985) give theory (and references to earlier work) showing the 
asymptotic equivalence of different variance estimators. Canty and Davison (1999) 
review resampling methods for variance estimation and illustrate how they can account 
for nonresponse adjustments. Chapter 6 of Shao and Tu (1995) presents theory for 
the jackknife and bootstrap used in complex surveys. 
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The Cls presented in this chapter are developed under the design-based approach. 
A 95% CI may be interpreted in the repeated sampling sense that if samples were 
repeatedly taken from the finite population, we would expect 95% of the resulting 
Cls to include the true value of the quantity in the population. In some situations, you 
may want to consider constructing a conditional CI instead. In poststratification, for 
example, the sample sizes in each poststratum are random variables; a conditional 
CI estimates the variance conditionally on the poststrata sample sizes [see (4.22)]. 
Sarndal et al. (1992, section 7.10) and Casady and Valliant (1993) discuss conditional 
Cls and give a bibliography of other work. 


A. Introductory Exercises 


Which of the variance estimation methods in this chapter would be suitable for esti- 
mating the proportion of beds that have bednets for the Gambia bednet survey in 
Example 7.1? Explain why each method is or is not appropriate. 


Use the jackknife to estimate V(y) for the data in srs30.dat, and verify that Vix (9) = 
s*/30 for these data. What are the jackknife weights for jackknife replicate j? 


Use Woodruff’s method to construct a 95% CI for the median of the data in file 
srs30.dat. 


Estimate the 25th percentile, median, and 75th percentile for the variable acres92 in 
file agstrat.dat, used in Example 3.2. Give a 95% CI for each parameter. 


B. Working with Survey Data 

Use the random groups in the data file syc.dat to estimate the variances for the estimates 
of the proportion of youth who: 

a Are age 14 or younger 

b_ Are held for a violent offense 

ec Lived with both parents when growing up 

d Are male 

e Are Hispanic 

f Grew up primarily in a single-parent family 

g Have used illegal drugs. 

Calculate the jackknife estimate of the variance for the regression estimate of the 
population mean age of trees in a stand for the data in Exercise 3 of Chapter 4. How 


does the jackknife variance compare with the variance calculated using linearization 
methods? 


Use the jackknife to estimate the variances of your estimates in parts (b) and (c) of 
Exercise 16 of Chapter 5. 
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13 
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Use the jackknife to estimate the variance of the ratio estimator used in Example 4.2. 
How does it compare with the linearization estimator? 


Use Woodruff’s method to construct a 95% CI for the median weekday greens fee for 
nine holes, using the SRS in file golfsrs.dat. 


Use the data in nhanes.dat along with the BRR method to estimate the variance of the 
ratio of triceps skinfold to body mass index (see Exercise 15 of Chapter 7). 


Use the data in nhanes.dat along with the BRR method to estimate the variance of the 
estimated median value of waist circumference (see Exercise 16 of Chapter 7). 


Use the data in file ncvs2000.dat for this exercise. The public use NCVS files list a 
total of 184 pseudo-strata, each with two pseudo-psus (as with the NHANES data in 
Example 9.5, the original stratification and clustering information is altered to preserve 
confidentiality). Construct replicate weights variables from the pseudo-stratum and 
pseudo-psu information, using the BRR method. Then use the replicate weights to 
estimate the percentage of persons who are victims of a violent crime, along with its 
standard error. Compare your results to those of Exercise 17 in Chapter 7. 


C. Working with Theory 
All of the problems in this section require probability and calculus. 


As in Example 9.1, let h(p) = p( — p). 


a_ Find the remainder term in the Taylor expansion, i " (x — t)h’(t)dt, and use it to 
find an exact expression for h(p). 


b Is the remainder term likely to be smaller than the other terms? Explain. 


c Find an exact expression for V[/(p)] for a simple random sample with replace- 
ment. How does it compare with the approximation in Example 9.1? Hint: Use 
moments of the Binomial distribution to find E(p*). 


The straight-line regression slope for the population is 
N 
Y> @i — Fu)Oi — Fu) 


B, = =! 2 
>> @i — x0! 
i=l 


a pues Bi asa AuHeHon of por ulenon totals ft; = eee Xi, = se xB = 
pee Vi, 14 = » x?, and th = bare 1=N, so that By = h(t), to, ts, ta, ts). 


i=1 i? 


linearization method to find an approximation to the variance of By. Express your 
answer in terms of V(t;) and Cov ({;, ;). 

What is the linearization approximation to the variance for an SRS of size n? 
Find a linearized variate g; so that VB v= V(i,). 
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The variance of a population is 


N 
1 
8 =—— J) -Fy)- 
N-1 = 


a Express S? as a function h of population totals r; = ae we t= ye yj, and 
N 
B= Dini (D. 
b Find an estimator $2 by substituting estimators for t), f2, and f3. 


c Find the linearization variance estimator of S”. 


The correlation coefficient for the population is 


N 
>> @i — Xu) - Yu) 


i=l 
N N 
> i —k0Y Y2 01 - Fv 
i=l i=l 


a Ep press R as a function of population totals 4 = ae Xj, 2 = ae yi, B= 
ee I ae 54 = ae Xii, and ts = ee y?, so that R = A(t, to, fs, ta, ts). 

b Letr=A(t,,...,f5), and suppose that E[?;]=1;fori=1,...,5. Use the lineariza- 
tion method to find an approximation to the variance of r. Express your answer 
in terms of V(#;) and Cov (f;, ij). 


R= 


ec What is the linearization approximation to the variance for an SRS of size n? 


Variance estimation with poststratification. Suppose we poststratify the sample into 
L poststrata, with population counts N;, N2,...,Nz_. Then the poststratified estimator 
for the population total is 


L 


x Ni. x ew a * 
tpost = a xt = h(t,...,t1,N1,...,Nz), 
j=1 *Yl 


t= y wixiyi, Nie = ) WiXlis 


icS icS 


where 


and x); = | if unit 7 is in poststratum / and 0 otherwise. Show, using linearization, that 


L 
A K th «a 
V (tpost) xV ) (i — uA) 


I=1 
ie K 
: a ne m 
We can thus let g; = Ss Xli (» a x) and estimate V (tpost) by V(fpost) = V ( ye wa) : 
l=l N; ieS 
Consider the random group estimator of the variance from Section 9.2.2. The param- 
eter of interest is 0 = y,,. A simple random sample with replacement of size n is taken 
from the population. The sample is divided into R random groups, each of size m. Let 
6, be the sample mean of the m observations in random group r, let 6= y= ee 1 6, /R, 
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and let V3(6) be the variance estimator defined in (9.5). Show that 


vivo fast wd 
2 ~|m m R-1 JR 


where k= )-*_, (i — ¥y)4/[(N — S41. 
Suppose a stratified random sample is taken with two observations per stratum. Show 
that if ae 1 & nd, = 0 for 1 # h, then 
Varro) = Vat Fstr)- 
HINT: First note that 


H 
‘ Nn Yar — Yr 
LP a a 


ih 
aa N 2 


Vstr(Oi) = Yotr ra 


Then express VarR Vern) directly using y,; and yj2. 
Other BRR estimators of the variance are 
R 
1 


4R 


r=1 


[6(@,) — (—e,)° 

and 
1 R 
ey HO(ar) — 6) + {0(—a,) — 6}. 
aR 


For a stratified random sample with two observations per stratum, show that if 
SS ana, = 0 for! ¥¢ h, then each of these variance estimators is equivalent 


to VetrWetr)- 


Suppose the parameter of interest is 6 = h(t), where h(t) = at? + bt +c and ¢ is the 
population total. Let 6 = A(?). Show, in a stratified random sample with two obser- 
vations per stratum, that if ee 1; &nO) = 0 for 1 A h, then 


aR: ; yen a) = 0,0), 


the linearization estimator of the variance (see Rao and Wu, 1985). 


The linearization method in Section 9.1 is the one historically used to find variances. 
Binder (1996) proposes proceeding directly to the estimate of the variance by eval- 
uating the partial derivatives at the sample estimates rather than at the population 
quantities. What is Binder’s estimate for the variance of the ratio estimator? Does it 
differ from that in Section 9.1? 


Analternative approach to linearization variance estimators. Demnati and Rao (2004) 
derive a unified theory for linearization variance estimation using weights. Let 6 be 
the population quantity of interest, and define the estimator 6 to be a function of the 
vector of sampling weights and the population values: 


é= g(W, Y1,Y2; tee Yk) 
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where w= (w,...,wy)’ with w; the sampling weight of unit 7 (w; = 0 if 7 is not in 
the sample), and y; is the vector of population values for the jth response variable. 
Then a linearization variance estimator can be found by taking the partial derivatives 
of the function with respect to the weights. Let 


Og(W, Y1,Y2,---+Yk) 
i= 
OW; 


evaluated at the sampling weights w;. Then we can estimate v(é) by 


VOH=VH=V ( we : 


ieS 
For example, considering the ratio estimator of the population total, 


, Ss WKY 


, i 
6= 9(w,x,y) = 21, = 


= — __p... 
i Sen 


keS 


The partial derivative of 6= g(w, X, y) with respect to w; is 


Xj se WkYk 


dg(w, x, y) Yi keS ~ vt 
— — u ty S sty = (yi — Bx) =. 
Ow; ae WEXE ty 
keS oD. WkXK 
keS 


For an SRS, finding the estimated variance of 7, gives (4.11). 
Consider the poststratified estimator in Exercise 17. 


a Write the estimator as fs = 9(W, y, X1,..., Xz), where x;; = | if observation i is 
in poststratum / and 0 otherwise. 


b_ Find an estimator of VGoost) using the Demnati—Rao (2004) approach. 
Consider the model sometimes adopted for GVFs in (9.13). Consider the one-stage 
cluster design studied in Section 5.2.2, in which each psu has size M and an SRS of 


n psus is selected from the N psus in the population. Assume that N is large and n/N 
is small. 


a_ Fora binary response with p= y,, show that 


V@~ avert =P) 4 +(M — 1)ICC]. 
nM 


b Show that the relative variance v= V(7)/t? can be written as v © a + B/t, and 
give a and 6. Consequently, if the intraclass correlation coefficient is similar for 
responses in the survey, the GVF method should work well. 

Let b be an estimator for f in the model for GVFs in (9.13). Let B= 1, /t, and B= s fis 

Suppose that B and i, are independent. 
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a Using the model in (9.13) and the result in (9.2), show that we can estimate 
V(B) by 


b Now let B be a proportion for a subpopulation, where f, is the size of the sub- 
population and #, is the number of units in that subpopulation having a certain 
characteristic. Show that V(B) = bB(1 — B)/7, and that V(B) = V(1 — B). 


D. Projects and Activities 


Forest data. Use the forest data from Exercise 36 of Chapter 2 for this exercise. 
Construct the jackknife weights for your SRS of size 2000, and use the jackknife 
weights to estimate the variance of the ratio of hillshade index at 9 am to hillshade 
index at noon. Compare your answer with the linearization variance estimate you 
calculated in Exercise 41 of Chapter 4. 


Index fund. In Exercise 43 of Chapter 6, you selected a sample of size 30 from the 
S&P 500 companies with probability proportional to market capitalization. Construct 
jackknife weights for this sample. 


Trucks. Use the data from the Vehicle Inventory and Use Survey (VIUS), described 
in Exercise 34 of Chapter 3, for this problem. The survey design is stratified random 
sampling, with a sample size of 136,113 trucks. 


a Which of the variance estimation methods in this chapter can be used to 
estimate the variance of the estimated ratio of miles driven in 2002 (miles_annl) 
to lifetime miles driven (miles_life)? What are the advantages and drawbacks of 
each method? 


b Which of the variance estimation methods can be used to estimate the variance of 
the estimated median number of miles driven in 2002? What are the advantages 
and drawbacks of each method? 


ce Use the bootstrap with 500 replications to estimate the variances of the estimates 
in (a) and (b). 


Baseball data. Construct jackknife weights for your dataset from Exercise 35 of 
Chapter 3. Use these weights to estimate the variance of the estimated mean of the 
variable logsal, and of the ratio (total number of home runs)/(number of runs scored). 


IPUMS exercises. Construct the jackknife weights for your dataset from Exercise 38 
of Chapter 5. Use these weights to estimate the variances of the estimated population 
mean and total of inctot. 


Find a survey on the Internet that releases replicate weights for variance estimation. 
What method was used to construct the replicate weights? How are nonresponse 
adjustments incorporated into the replicate weights? 


Activity for course project. Describe the method used for variance estimation for the 
survey you looked at in Exercise 31 of Chapter 7. If the survey releases replicate 
weight variables, what method was used to construct them? Do the replicate weights 
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incorporate nonresponse adjustments? If so, how? Now do either (a) or (b) for your 
survey: 


a_ Ifthe survey releases replicate weight variables, use them to estimate the variance 
of the estimated means you found in Exercise 31 of Chapter 7. If the replicate 
weights are formed by BRR or bootstrap, also estimate the variances of the esti- 
mated quantiles. 


b_ If the survey releases stratification and clustering information, use these to con- 
struct replicated weights using one of the resampling methods described in this 
chapter. Use the replicate weights to estimate the variance of the estimated means 
you found in Exercise 31 of Chapter 7. 


Categorical Data Analysis in 
Complex Surveys 


But Statistics must be made otherwise than to prove a preconceived idea. 


—Florence Nightingale, Annotation in Physique Sociale by A. Quetelet 


Up to now we have mostly been looking at how to estimate summary quantities such 
as means, totals, and percentages in different sampling designs. Totals and percent- 
ages are important for many surveys to provide a description of the population: for 
instance, the percentage of the population victimized by crime or the total number of 
unemployed persons in the United States. Often, though, researchers are interested 
in multivariate questions: Is race associated with criminal victimization, or can we 
predict unemployment status from demographic variables? Such questions are typi- 
cally answered in statistics using techniques in categorical data analysis or regression. 
The techniques you learned in an introductory statistics course, though, assumed that 
observations were all independent and identically distributed from some population 
distribution. These assumptions are no longer met in data from complex surveys; in 
this and the following chapter we examine the effects of the complex sampling design 
on commonly used statistical analyses. 

Since much information from sample surveys is collected in the form of percent- 
ages, categorical data methods are extensively used in the analysis. In fact, many of 
the data sets used to illustrate the chi-square test in introductory statistics textbooks 
originate in complex surveys. Our greatest concern is with the effects of clustering on 
hypothesis tests and models for categorical data, since clustering usually decreases 
precision. We begin by reviewing various chi-square tests when a simple random 
sample (SRS) is taken from a large population. 


10.1 


Chi-Square Tests with Multinomial Sampling 


EXAMPLE 10.1 Each couple in an SRS of 500 married couples from a large population is 
asked whether (1) the household owns at least one personal computer and (2) the 
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household subscribes to cable television. The following contingency table presents the 
outcomes: 


Computer? 
Observed Count Yes No 


Yes 119 188 307 


Cable? 
No 88 105 193 


207 293 500 


Are households with a computer more likely to subscribe to cable? A chi-square 
test for independence is often used for such questions. Under the null hypothesis that 
owning a computer and subscribing to cable are independent, the expected counts for 
each cell in the contingency table are the following: 


Computer? 
Expected Count Yes No 
Yes 127.1 179.9 307 
Cable? 
No 79.9 113.1 193 
207 293 500 


Pearson’s chi-square test statistic is 


> (observed count — expected count)” 
aD 


= 2.281. 
expected count 
all cells 
The likelihood ratio chi-square test statistic is 
> observed count 
G=2 os (observed count) In (———~) = 2.279: 
expected count 


all cells 


The two test statistics are asymptotically equivalent; for large samples, each approx- 
imately follows a chi-square ( x’) distribution with 1 degree of freedom (df) under 
the null hypothesis. The p-value for each statistic is 0.13, giving no reason to doubt 
the null hypothesis that owning a computer and subscribing to cable television are 
independent. 

If owning a computer and subscribing to cable are independent events, the odds 
that a cable subscriber will own a computer should equal the odds that a non-cable- 
subscriber will own a computer. We estimate the odds of owning a computer if the 
household subscribes to cable as 119/188 and estimate the odds of owning a computer 
if the household does not subscribe to cable as 88/105. The odds ratio is therefore 
estimated as 


119 


188 _ 
ge = 0.755. 


105 
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If the null hypothesis of independence is true, we expect the odds ratio to be close to 
one. Equivalently, we expect the logarithm of the odds ratio to be close to zero. The 
log odds is —0.28 with asymptotic standard error 


/ ee nee Ce ghee 

119 * 38 * igs + 105 ~ 918% 

an approximate 95% confidence interval (CI) for the log odds is —0.28 + 1.96(0.186) = 
[—0.646, 0.084]. This CI includes 0, and confirms the result of the hypothesis test that 
there is no evidence against independence. «= 


Chi-square tests are commonly used in three situations: testing independence of 
factors, testing homogeneity of proportions, and testing goodness of fit. Each assumes 
a form of random sampling. These tests are discussed in more detail in Agresti (2002) 
and Simonoff (2006). 


111 Testing Independence of Factors 


Each of n independent observations is cross-classified by two factors: row factor R 
with r levels and column factor C with c levels. Each observation has probability pj 
of falling into row category i and column category j, giving the following table of 
true probabilities. Here, pj+ = Doe 1 Pi 18 the probability that a randomly selected 
unit will fall in row category i, and p4; = )~._, pi is the probability that a randomly 


selected unit will fall in column category j. 


Cc 
1 2 Cc 
1 Pu P12 Pic Pi 
2 P21 P22 P2c P2 
R : 
r Pri Pr2 Pre Pr 
P+i P42 nes Pre 1 


The observed count in cell (i,j) from the sample is x;;. If all units in the sample 
are independent, the x,;’s are from a multinomial distribution with rc categories; this 
sampling scheme is known as multinomial sampling. In surveys, the assumptions 
for multinomial sampling are met in an SRS with replacement; they are approxi- 
mately met in an SRS without replacement when the sample size is small compared 
with the population size. The latter situation occurred in Example 10.1: Independent 
multinomial sampling means we have a sample of 500 (approximately) independent 
households, and we observe to which of the four categories each household belongs. 

The null hypothesis of independence is 


Ao: pi =pi+ps; fori=1,...,r and j=1.,...,c. (10.1) 
Let mj; = npjj represent the expected counts. If Ho is true, my = np;_p+j, and mj can 
be estimated by 
Xi+ X4j 


a 2 a are 
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where pj = xi/n, P+; = Dj-1 Py, and Pix = D°-_, Py Pearson’s chi-square test 
Statistic is 


Jy Gi =f aT Dey ae (10.2) 


i=1 j=l i=1 j=1 PisPi 
The likelihood ratio test statistic is 


G? = 25°) xyin( 7 = Dns 


i=l j=l 


_). (10.3) 


Pi+P Pj 


If multinomial sampling is used with a sufficiently large sample size, X* and G? are 
approximately distributed as a x? random variable with (r — 1)(c — 1) degrees of 
freedom (df) under the null hypothesis. How large is “sufficiently large” depends on 
the number of cells and expected probabilities; Fienberg (1979) argues that p-values 
will be approximately correct if (a) the expected count in each cell is greater than | 
and (b) n > 5 x (number of cells). 

An equivalent statement to (10.1) is that all odds ratios equal 1: 


PuPi 
PyPil 
We may estimate any odds ratio (pypy)/(Pupsj) by substituting in estimated propor- 


tions: (Pu) /(Pipy). Uf the sample is sufficiently large, the logarithm of the estimated 
odds ratio is approximately normally distributed with estimated variance (see Exer- 


cise 15) 
7 i 1 1 1 1 
PilPkj Xij Xk Xil Xkj 


Ao: =1 for alli > 2 andj > 2. 


10.12 Testing Homogeneity of Proportions 


EXAMPLE 10.2 


The Pearson and likelihood ratio test statistics in (10.2) and (10.3) may also be used 
when independent random samples from r populations are each classified into c 
categories. Multinomial sampling is done within each population, so the sampling 
scheme is called product-multinomial sampling. Product-multinomial sampling is 
equivalent to stratified random sampling when the sampling fraction for each stratum 
is small or when sampling is with replacement. 

The difference between product-multinomial sampling and multinomial sam- 
pling is that the row totals p;; and x;; are fixed quantities in product-multinomial 
sampling—x;, is the predetermined sample size for stratum i. The null hypothesis 
that the proportion of observations falling in class j is the same for all strata is 


Py _ PY Pa 
Pit P2+ Pr+ 


If the null hypothesis in (10.4) is true, again mj = np;.p4; and the expected counts 
under Ho are mj = np;+p+;, exactly as in the test for independence. 


Ho: 


=p, forallj=1,...,c. (10.4) 


The sample sizes used in Exercise 14 of Chapter 3, the stratified sample of nursing 
students and tutors, were the sample sizes for the respondents. Let’s use a chi-square 
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test for homogeneity of proportions to test the null hypothesis that the nonresponse 
rate is the same for each stratum. The four strata form the rows in the following 


contingency table. 


Nonrespondent Respondent 
General student 46 222 268 
General tutor 41 109 150 
Psychiatric student 17 40 57 
Psychiatric tutor 8 26 34 
112 397 509 


The two chi-square test statistics are X? = 8.218, with p-value 0.042 and G? = 8.165, 
with p-value 0.043. There is thus evidence of different nonresponse rates among the 
four groups. However, the following table shows that the difference is not attributable 
to either the main effect of general/psychiatric or student/tutor: 


Student Tutor 


| Nonresponse rate 


General 
Psychiatric 


Further investigation would be needed to explore the nonresponse pattern. 


I0.1.3 Testing Goodness of Fit 


17% 27% 


30% 24% 


In the classical goodness of fit test, multinomial sampling is again assumed, with 


independent observations classified into k categories. The null hypothesis is 


0 
Ho: pi =P 


fori=1,...,k, 


where p” is prespecified or is a function of parameters 6 to be estimated from the 


data. 


EXAMPLE 10.3 Webb (1955) examined the safety records for 17,952 Air Force pilots for an 8-year 


period around World War II and constructed the following frequency table. 


Number of Accidents 


Number of Pilots 


NADU WNr CO 


12,475 
4,117 
1,016 

269 
53 
14 

2 
2 
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If accidents occur randomly—if no pilots are more or less “accident-prone” than 
others—a Poisson distribution should fit the data well. We estimate the mean of the 
Poisson distribution by the mean number of accidents per pilot in the sample, 0.40597. 
The observed and expected probabilities under the null hypothesis that the data follow 
a Poisson distribution are given in the following table. The expected probabilities are 
computed using the Poisson probabilities e~*A*/x! with 4 = 0.40597. 


Number of Observed Expected Probability 
Accidents Proportion, p; Under Ap, D 

0) 0.6949 0.6663 

1 0.2293 0.2705 

2 0.0566 0.0549 

3 0.0150 0.0074 

4 0.0030 0.0008 

5+ 0.0012 0.0001 


The two chi-square test statistics are 


> (observed count — expected count)” 
vel 


Ereszi expected count 


k A »(0)\2 
np; — np; 
= > (npi ae ) (10.5) 
i=l Np; 
kK os a(0)\2 
(Pi — Pi) 
- nye 0) 


and 


k A 
S A Pi 
i=1 i 


For the pilots, X* = 756 and G* = 400. If the null hypothesis is true, both statistics 
follow a x? distribution with 4 df (2 df are spent on n and i). Both p-values are less 
than 0.0001, providing evidence that a Poisson model does not fit the data. More 
pilots have no accidents, or more than two accidents, than would be expected under 
the Poisson model. There is thus evidence that some pilots are more accident-prone 
than would occur under the Poisson model. s 


All of the chi-square test statistics in (10.2), (10.3), (10.5), and (10.6) grow 
with n. If the null hypothesis is not exactly true in the population—if households 
with cable are even infinitesimally more likely to own a personal computer than 
households without cable—we can almost guarantee rejection of the null hypothe- 
sis by taking a large enough random sample. This property of the hypothesis test 
means that it will be sensitive to artificially inflating the sample size by ignoring 
clustering. 


10.2 
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Effects of Survey Design on Chi-Square Tests 


EXAMPLE 10.4 


The survey design can affect both the estimated cell probabilities and the tests of 
association or goodness of fit. In complex survey designs, we no longer have the 
random sampling that gives both X* and G? an approximate x distribution. Thus, if 
we ignore the survey design and use the chi-square tests described in Section 10.1, 
the significance levels and p-values will be wrong. Clustering, especially, can have a 
strong effect on the p-values of chi-square tests. In a cluster sample with a positive 
intraclass correlation coefficient (ICC), the true p-value will often be much larger than 
the p-value reported by a statistical package using the assumption of independent 
multinomial sampling. Let’s see what can happen to hypothesis tests if the survey 
design is ignored in a cluster sample. 


Suppose that both husband and wife are asked about the household’s cable and com- 
puter status for the survey discussed in Example 10.1, and both give the same answer. 
While the assumptions of multinomial sampling were met for the SRS of couples, 
they are not met for the cluster sample of persons—far from being independent units, 
the husband and wife from the same household agree completely in their answers. 
The ICC for the cluster sample is 1. 

What happens if we ignore the clustering? The contingency table for the observed 
frequencies is as follows: 


Computer? 
Observed Count Yes No 
Yes 238 376 614 


Cable? 
No 176 210 386 


414 586 1000 


The estimated proportions and odds ratio are identical to those in Example 10.1: 
Pi = 238/1000 = 119/500 and the odds ratio is 


238 


S76. = 
ye = 0-755. 


210 


But X* = 4.562 and G* = 4.550 are twice the values of the test statistics in Exam- 
ple 10.1. If you ignored the clustering and compared these statistics to a x? distribution 
with 1 df, you would report a “p-value” of 0.033 and conclude that the data provided 
evidence that having a computer and subscribing to cable are not independent. If 
playing this game, you could lower the “p-value” even more by interviewing both 
children in each household as well, thus multiplying the original test statistics by 4. 
Can you attain an arbitrarily low p-value by observing more ssus per psu? Abso- 
lutely not. The statistics X? and G? have a null yj distribution when multinomial 
sampling is used. When a cluster sample is taken instead, and when the intraclass 
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correlation coefficient is positive, X? and G? do not follow a xj distribution under 
the null hypothesis. For the 1000 husbands and wives, X7/2 and G?/2 follow a xt 
distribution under Hp—this gives the same p-value found in Example 10.1. = 


10.2.1 Contingency Tables for Data from Complex Surveys 


The observed counts x; do not necessarily reflect the relative frequencies of the cat- 
egories in the population unless the sample is self-weighting. Suppose an SRS of 
elementary school classrooms in Denver is taken, and each of ten randomly selected 
students in each classroom is evaluated for self-concept (high or low) and clini- 
cal depression (present or not). Students are selected for the sample with unequal 
probabilities—students in small classes are more likely to be in the sample than stu- 
dents from large classes. A table of observed counts from the sample, ignoring the 
inclusion probabilities, would not give an accurate picture of the association between 
self-concept and depression in the population if the degree of association differs with 
class size. Even if the association between self-concept and depression is the same 
for different class sizes, the estimates of numbers of depressed students using the 
margins of the contingency table may be wrong. 

Remember, though, that sampling weights can be used to estimate any population 
quantity. Here, they can be used to estimate the cell proportions. Estimate p; by 


>. WKYkij 


~ _ keS 


keS 


(10.7) 


where 
___ J 1 if observation unit k is in cell (i,j) 
Yki =) otherwise 
and w, is the weight for observation unit k. Thus, 


sum of weights for observation units in cell (i,j) 


By sum of weights for all observation units in sample’ 


If the sample is self-weighting, p;; will be the proportion of observation units falling 
in cell (i,j). Using the estimates p;;, construct the table 


Cc 
1 2 c 
1 Pu Pr Pic Pi 
Poi Pr Pr Po 
R . 
r Pri Pr at Pre Pr 
Psi P42 ih Pre 1 


to examine associations, and estimate odds ratios by (DyPxu)/ (Buy). A Cl for pj; may 
be constructed by using any method of variance estimation discussed so far, or a 
design effect (deff) may be used to modify the SRS CI, as in (7.8). 


10.2 Effects of Survey Design on Chi-Square Tests th 


Do not throw the observed counts away, however. If the odds ratios calculated 
using the pj; differ appreciably from the odds ratios calculated using the observed 
counts xj, you should explore why they differ. Perhaps the odds ratio for depression 
and self-concept differs for larger classes or depends on socioeconomic factors related 
to class size. If that is the case, you should include these other factors in a model for 
the data or perhaps test the association separately for large and small classes. 


10.22 Effects on Hypothesis Tests and Confidence Intervals 


We can estimate contingency table proportions and odds ratios using weights. The 
weights, however, are not sufficient for constructing hypothesis tests and CIs—these 
depend on the clustering and (sometimes) stratification of the survey design. 

Let’s look at the effect of stratification first. If the strata in a stratified random 
sample are the row categories, the stratification poses no problem—we essentially 
have product-multinomial sampling as described in Section 10.1 and can test for 
homogeneity of proportions the usual way. 

Often, though, we want to study association between factors that are not stratifi- 
cation variables. In general, stratification increases precision of the estimates. For an 
SRS, (10.2) gives 


Yen 3 a Bij = Bis Pi) 
121 Pi+P+j 

A stratified sample with n observation units provides the same precision for estimating 
pi as an SRS with n/dj observation units, where dj is the deff for estimating pj. If 
the stratification is worthwhile, the deffs will generally be less than 1. Consequently, 
if we use the SRS test statistics in (10.2) or (10.3) with the Pi from the stratified 
sample, X? and G? will be smaller than they should be to follow a null Xo—1)0-1) 
distribution; “p-values” calculated ignoring the stratification will be too large and 
Ho will not be rejected as often as it should be. Thus, while SAS PROC FREQ or 
another statistics program for non-survey data may give you a p-value of 0.04, the 
actual p-value may be 0.02. Ignoring the stratification results in a conservative test. 
Similarly, a CI constructed for a log odds ratio is generally too large if the stratification 
is ignored. Your estimates are really more precise than the SRS CI indicates. 

Clustering usually has the opposite effect. Design effects for pj with a cluster 
sample are usually greater than 1—a cluster sample with n observation units gives the 
same precision as an SRS with fewer than n observations. If the clustering is ignored, 
X* and G’ are expected to be larger than if the equivalently sized SRS were taken, and 
“p-values” calculated ignoring the clustering are likely to be too small. An analysis 
ignoring the survey design may give you a p-value of 0.04, while the actual p-value 
may be 0.25. If you ignore the clustering, you may well declare an association to be 
statistically significant when it is really just due to random variation in the data. CIs 
for log odds ratios will be narrower than they should be—the estimates are not as 
precise as the CIs from an SRS-based analysis would lead you to believe. 

Ignoring clustering in chi-square tests is often more dangerous than ignoring strat- 
ification. An SRS-based chi-square test using stratified data will still indicate strong 
associations; it just will not uncover all weaker associations. Ignoring clustering, 
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however, will lead to declaring associations statistically significant that really are not. 
Ignoring the clustering in goodness-of-fit tests may lead to adopting an unnecessarily 
complicated model to describe the data. 

An investigator ignorant of sampling theory will often analyze a stratified sample 
correctly, using the strata as one of the classification variables. But the investigator may 
not even record the clustering, and too often simply runs the observed counts through 
SAS PROC FREQ or SPSS CROSSTABS and accepts the printed-out p-value as truth. 
To see how this could happen, consider an investigator wanting to replicate Basow 
and Silberg’s (1987) study on whether male and female professors are evaluated 
differently by college students. (The original study was discussed in Example 5.1.) 
The investigator selects a stratified sample of male and female professors at the college 
and asks each student in those professors’ classes to evaluate the professor’s teaching. 
Over 2000 student responses are obtained, and the investigator cross-classifies those 
responses by professor gender and by whether the student gives the professor a high or 
low rating. The investigator, comparing Pearson’s X? statistic on the observed counts 
toa re distribution, declares a statistically significant association between professor 
gender and student rating. The stratification variable professor gender is one of the 
classification variables, so no adjustments need be made for the stratification. But 
the reported p-value is almost certainly incorrect, for a number of reasons: (1) The 
clustering of students within a class is ignored—indeed, the investigator does not 
even record which professor is evaluated by a student, but only records the professor’s 
gender, so the investigation cannot account for the clustering. If student evaluations 
reflect teaching quality, students of a “good” professor would be expected to give 
higher ratings than students of a “bad” professor. The ICC for students is positive, 
and the equivalent sample size in an SRS is less than 2000. The p-value reported by the 
investigator is then much too small, and the investigator may be wrong in concluding 
faculty women receive a different mean level on student evaluations. (2) A number of 
students may give responses for more than one professor in the sample. It is unclear 
what effect these multiple responses would have on the test of independence. (3) Not 
all students attend class or turn in the evaluation. Some of the nonresponse may be 
missing completely at random (a student was ill the day of the study), but some may 
be related to perceived teaching quality (the student skips class because the professor 
is confusing). 

The societal implications of reporting false positive results because clustering 
is ignored can be expensive. A university administrator may decide to give female 
faculty an unnecessary handicap when determining raises that are based in part on 
student evaluations. A medical researcher may conclude that a new medication with 
more side effects than the standard treatment is more effective for combating a disease, 
even though the statistical significance is due to the cluster inflation of the sample 
size. A government official may decide that a new social program is needed to remedy 
an “inequity” demonstrated in the hypothesis test. The same problem occurs outside 
of sample surveys as well, particularly in biostatistics. Clusters may correspond to 
pairs of eyes, to patients in the same hospital, or to repeated measures on the same 
person. 

Is the clustering problem serious in surveys taken in practice? A number of studies 
have found that it can be. Holt et al. (1980) found that the actual significance levels for 
tests nominally conducted at the a = 0.05 level ranged from 0.05 to 0.50. Fay (1985) 
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references a number of studies demonstrating that the SRS-based test statistics “may 
give extremely erroneous results when applied to data arising from a complex sample 
design.” The simulation study in Thomas et al. (1996) calculated actual significance 
levels attained for X? and G* when the nominal significance level was set at a = 
0.05—they found actual significance levels of about 0.30 to 0.40. 


10.3 


Corrections to x7 Tests 


In this section, we outline some of the basic approaches for testing independence with 
data from a complex survey. The theory for goodness of fit tests and tests for homo- 
geneity of proportions is similar. In complex surveys, though, unlike in multinomial 
and product multinomial sampling, the tests for independence and homogeneity of 
proportions are not necessarily the same. Holt et al. (1980) note that often (but not 
always) clustering has less effect on tests for independence than on tests for goodness 
of fit or homogeneity of proportions. 
Recall from (10.1) that the null hypothesis of independence is 


Ao : pij = PitP+i fori=1,...,r and j=1l,...,c. 


For a2 x 2 table, pj; = pi+p+; for all i andj is equivalent to pi1p22 — pi2p21 = 0, so 
the null hypothesis reduces to a single equation. In general, the null hypothesis can 
be expressed as (r — 1)(c — 1) distinct equations, which leads to (r — 1)(c — 1) df for 
the x* tests used for multinomial sampling. Let 
Oi; = Pij — Pi+P+i- 
Then, the null hypothesis of independence is 
Ay : 0; =0, 12 = 9, ..-, OHtye-1 = 9. 


10.3.1 Wald Tests 


The Wald (1943) test was the first to be used for testing independence in complex 
surveys (Koch et al., 1975). For the 2 x 2 table, the null hypothesis involves one 
quantity, 
9 = 61 = Pu — P1+P+1 = Pip22 — P1221; 
and @ is estimated by 
6 = Pipx — Propo. 
The quantity @ is a smooth function of population totals, so we estimate v6) using 


one of the methods in Chapter 9. If the sample sizes are sufficiently large and 


Ho: 6 =0 is true, then 6/ V/ V) approximately follows a standard normal distribu- 
tion. Equivalently, under Ho, the Wald statistic 


62 


= FO (10.8) 


EXAMPLE 10.5 
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approximately follows a x” distribution with 1 df. In practice, we often compare Xe 
to an F distribution with 1 and « df, where x is the df associated with the variance 
estimator. If the random group method is used to estimate the variance, then « equals 
(number of groups) — 1; if another method is used, « equals (number of psus) — 
(number of strata). 


Let’s look at the association between variables “Was anyone in your family ever 
incarcerated?” (variable famtime) and “Have you ever been put on probation or sent 
to a correctional institution for a violent offense?” (variable everviol) using data from 
the Survey of Youths in Custody. A total of n = 2588 youths in the survey had 
responses for both items. The following table gives the sum of the weights for each 
category. 


Ever Violent? 
No Yes Total 
Family Member No 4,761 7,154 11,915 
Incarcerated? Yes 4,838 7,946 12,784 


Total 9,599 15,100 24,699 


This results in the following table of estimated proportions: 


Ever Violent? 
No Yes Total 
Family Member No 0.1928 0.2896 0.4824 
Incarcerated? Yes 0.1959 0.3217 0.5176 
Total 0.3886 0.6114 1.0000 


Thus, 


A 


0 = Pipx — Pi2pr = Pi — PisP41 = 0.0053. 


We can write 0 = h(pi1, P12, P21, P22) and @ = A(pi1, P12; P21, Piz) for h(a, b,c, d) = 
ad — bc, so we can use linearization (see Exercise 14) or a resampling method to 
estimate V(6). The random group method can also be used, although it does not give 
variance estimates that are as accurate as the other methods (see Exercise 5). Using 
linearization in SAS PROC SURVEYFREQ (SAS code is on the website), we obtain 
X?, = (0.0053)"/V(6) = 0.995 with p-value = 0.32. 

This test gives no evidence of an association between the two factors, when we 
look at the population as a whole. But of course the hypothesis test does not say 
anything about possible associations among the two variables in subpopulations—it 
could occur, for example, that violence and incarceration of a family member are 
positively associated among older youth, and negatively associated among younger 
youth—we would need to look at the subpopulations separately or fit a loglinear 
model to see if this was the case. = 


10.3 Corrections to x* Tests MJ 


For larger tables, let 6; = py — pi+p+j and let @ = [01; O12 ... eal be the 
(r — 1)(c — 1)-vector of 6;;’s, so that the null hypothesis is 
A :6=0. 
The Wald statistic is then 
RT ee & ‘nm 
XxX, = 6 V6) '6, 
where V6) is the estimated covariance matrix of 6. In very large samples, xy approx- 
imately follows a Xo—1e-1) distribution under Ho. But “large” in a complex survey 
refers to a large number of psus, not necessarily to a large number of observation 
units. In a 4 x 4 contingency table, V(@) is a9 x 9 matrix, and requires calculation 
of 45 different variances and covariances. If a cluster sample has only 50 psus, the 
estimated covariance matrix will be very unstable. In practice, the Wald test for large 
contingency tables often performs poorly, and we do not recommend its use. Some 
modifications of the Wald test perform better; see Thomas et al. (1996). 


Thomas (1989) suggested using the Bonferroni correction for larger tables. In an 
R x C table the null hypothesis of independence, 


Ao : O11 = 0, 2 = 0, ..., O-1e-1 = 0, 
has m = (r — 1)(c — 1) components: 


Ao(1) : 01, = 90 
A(2) : 012 = 0 


Ho(m) : Or—-1yc-1) = 0. 


Instead of using the estimated covariance of all 6's as in the full multivariate Wald 

test, we can use the Bonferroni inequality to test each component Ho(k) separately 

with significance level a/m. Ho will be rejected at level a if any of the Ho(k) is rejected 
at level a /m—that is, if 

A2 

i 

V@ij) 

fori = 1,...,@ — 1) andj = 1,...,(c — 1). Each test statistic is compared to an 

F |, distribution, where the estimator of the variance has « df. Since the Bonferroni 

adjustment for multiple testing is used, this is a conservative test, although it appears 
to work well in practice. 


> Fi cof 


10.3.2 Rao-Scott Tests 


The test statistics X? and G* do not follow a XS—Ve-1 distribution in a complex 
survey under the null hypothesis of independence. But both statistics have a skewed 
distribution, and a multiple of X* or G* may approximately follow a x7 distribution. 

We can obtain a first-order correction by matching the mean of the test statistic 
to the mean of the Xo—ye-1 distribution (Rao and Scott, 1981, 1984). The mean of 
a Xer-1ye-1) distribution is (r — 1)(c — 1); we can calculate E[X*] or E[G?] under the 
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complex sampling design when H is true and compare the test statistic 


2 (r—1f(e- 1X? 
Sey 
or 

_¢-DeE- DE 


‘2, 
Gr E[G?] 


toa Kees 1 distribution. Bedrick (1983) and Rao and Scott (1984) show that 


under Ho, 


E[X?] © EI] * 9°) = pid — > — pid? — 95 1 — padf, (10.9) 


i=1 j=l i=l j=l 


where dj is the deff for estimating pj,j, d® is the deff for estimating p;,, and de is 
the deff for estimating p,;. In practice, if the estimator of the cell variances has « df, 
it works slightly better to compare X2/(r — 1)\(c— 1) or G3./(r —1)\(c-—1)toanF 


distribution with (r — 1)(c — 1) and (r — 1)(c — Dk df. 


The first-order correction can often be used with published tables because you 
need to estimate only variances of the proportions in the contingency table—you need 
not estimate the full covariance matrix of the pj as is required for the Wald test. But 
we are only adjusting the test statistic so that its mean under A is (r — 1)(c — 1); 
p-values of interest come from the tail of the reference distribution, and it does not 
necessarily follow that the tail of the distribution of X7, matches the tail of the Xe 1e-1) 
distribution. Rao and Scott (1981) show that ae and G;. have a null x? distribution 
if and only if all the deffs for the variances and covariances of the pj are equal. 
Otherwise, the variance of X? is larger than the variance of a im \c-1) distribution, 
and p-values from xX? are often a bit smaller than they should be (but closer to the 


actual p-values than if no correction was done at all). 


EXAMPLE 10.6 In the Survey of Youth in Custody, let’s look at the relationship between age and 
whether the youth was sent to the institution for a violent offense (using variable 
crimtype, currviol was defined to be 1| if crimtype = 1 and 0 otherwise). Using the 
weights, we estimate the proportion of the population falling in each cell: 


Age Class 
<15 16 or 17 > 18 Total 
No 0.1698 0.2616 0.1275 0.5589 
Violent Offense? 
Yes 0.1107 0.1851 0.1453 0.4411 
Total 0.2805 0.4467 0.2728 1.0000 


First, let’s look at what happens if we ignore the clustering and pretend that the 
test statistic in (10.2) follows a x? distribution with 2 df. With n = 2621 youths in 


the table, Pearson’s X° statistic is 


ae ee 
x =55 ye = 34.12. 
: Pi+Psj 
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Comparing this to a x distribution yields an incorrect “p-value” of 3.9 x 1078. 
The following design effects were estimated, using the stratification and clustering 
information in the survey: 


Age Class 
Design Effects <15 16 or 17 > 18 Total 
No 14.9 4.0 3.5 6.8 
Violent Offense? 
Yes 47 6.5 3.8 6.8 
Total 14.5 75 6.6 


Several of the deffs are very large, as might be expected because some facilities have 
mostly violent or mostly nonviolent offenders. All of the residents of facility 31, 
for example, are there for a violent offense. In addition, the facilities with primarily 
nonviolent offenders tend to be larger. We would expect the clustering, then, to have 
a substantial effect on the hypothesis test. 

Using (10.9), we estimate E[X”] by 4.9 and use X7 = 2X7/4.9 = 14.0. Comparing 
14.0/2 to an F169 distribution gives an approximate p-value of 0.001. SAS code for 
calculating the Rao—Scott test directly is given on the website. This p-value is probably 
still a bit too small, though, because of the wide disparity in the deffs. = 


Rao and Scott (1981, 1984) also proposed a second-order correction—matching 
the mean and variance of the test statistic to the mean and variance of a x? distribution, 
as done for ANOVA model tests by Satterthwaite (1946). Satterthwaite compared a 
test statistic T with skewed distribution to a x reference distribution by choosing a 
constant k and df v so that E[kKT] = v and V[kT] = 2v (v and 2v are the mean and 
variance of a x? distribution with v df). Here, letting m = (r — 1)(c — 1), we know 
that E [kX7] = km and 


ViKX2] =V kmX?) _ V[X?]k?m* 
eS (LEORS TECRY* 
so matching the moments gives 
E xX? 2, 
oN ih BE 
V[X?] m 
Then, 
a (10.10) 
5 (r—1e- 1) , 


is compared to a x? distribution with v df. The statistic Gy is formed similarly. Again, 
if the estimator of the variances of the pj has « df, it works slightly better to compare 
Xe) vor Gy /v to an F distribution with v and ve df. 

In general, estimating V[X7] is somewhat involved, and requires the complete 
covariance matrix of the DiS, so the second-order correction often cannot be used 
when the data are only available in published tables. If the deffs are all similar, the first- 
and second-order corrections will behave similarly. When the deffs vary appreciably, 
however, p-values using xX? may be too small, and Xe may perform better. Exercise 18 
tells how the second-order correction can be calculated. 
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10.3.3 Model-Based Methods for x2 Tests 


EXAMPLE 10.7 


All the methods above use the covariance estimates of the proportions to adjust the x? 
tests. A model-based approach may also be used. We describe a model due to Cohen 
(1976) for a cluster sample with two observation units per cluster. Extensions and 
other models that have been used for cluster sampling are described in Altham (1976), 
Brier (1980), Rao and Scott (1981), Wilson and Koehler (1991), and Chowdhury and 
McGilchrist (2001). Many of these models assume that the deff is the same for each 
cell and margin. 


Cohen (1976) presents an example exploring the relationship between gender and 
diagnosis with schizophrenia. The data consisted of 71 hospitalized pairs of siblings. 
Many mental illnesses tend to run in families, so we might expect that if one sibling 
is diagnosed with schizophrenia, the other sibling is more likely to be diagnosed with 
schizophrenia. Thus, any analysis that ignores the dependence among siblings is likely 
to give p-values that are much too small. If we just categorize the 142 patients by 
gender and diagnosis and ignore the correlation between siblings, we get the following 
table. Here, S means the patient was diagnosed with schizophrenia, and N means the 
patient was not diagnosed with schizophrenia. 


75 67 142 


If analyzed using the assumption of multinomial sampling, X* = 17.89 and G? = 
18.46. Such an analysis, however, assumes that all the observations are independent, 
so the “p-value” of 0.00002 is incorrect. 

We know the clustering structure for the 71 clusters, though. You can see in 
Table 10.1 that most of the pairs fall in the diagonal blocks: If one sibling has 
schizophrenia, the other is more likely to have it. In 52 of the sibling pairs, either 
both siblings are diagnosed as having schizophrenia, or both siblings are diagnosed 
as not having schizophrenia. 

Let qi be the probability that a pair falls in the (i,j) cell in the classification of 
the pairs. Thus, q1; is the probability that both siblings are schizophrenic and male, 
412 is the probability that the younger sibling is a schizophrenic female and the older 
sibling is a schizophrenic male, etc. Then model the q;;’s by 


er fe — 4)9:9j ifi £ j (10.11) 


where a is a clustering effect and g; is the probability that an individual is in class 
i (i = SM, SE, NM, NF). If a = 0, members of a pair are independent, and we 
can just do the regular chi-square test using the individuals—the usual Pearson’s X’, 
calculated ignoring the clustering, would be compared to a x byc—1) distribution. If 
a = 1, the two siblings are perfectly correlated so we essentially have only one piece 
of information from each pair—X* /2 would be compared to a XG2 1)(e—1) distribution. 


TABLE 10.1 


Cluster Information for the 71 Pairs of Siblings 
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Younger Sibling 
SM SF NM NF 
SM 13 3) 1 | 22 
Older SF 4 6 1 1 12 
Sibling NM 1 1 2 4 8 
NF 3 8 3 15 29 
21 20 7 23 71 


For a between 0 and 1, if the model holds, X*/(1 + a) approximately follows a 
he ine) if the null hypothesis is true. 

The model may be fit by maximum likelihood (see Cohen, 1976 for details). Then, 
a = .3006, and the estimated probabilities for the four cells are the following: 


S N Total 
Male 0.2923 0.1112 0.4035 
Female 0.2330 0.3636 0.5966 
Total 0.5253 0.4748 1.0000 


We can check the model by using a goodness-of-fit test for the clustered data in 
Table 10.1. This model does not exhibit significant lack of fit, while the model assum- 
ing independence does. For testing whether gender and schizophrenia are independent 
in the 2 x 2 table, X7/1.3006 = 13.76, which we compare to a Pei distribution. The 
resulting p-value is 0.0002, about 10 times as large as the p-value from the analysis 


that pretended siblings were independent. 


0.4 
Loglinear Models 


If there are more than two classification variables, we are often interested in seeing 
if there are more complex relationships in the data. Loglinear models are commonly 
used to study these relationships. 


1041 Loglinear Models with Multinomial Sampling 


In a two-way table, if the row variable and the column variable are independent, then 


Pi = Pi+P+j. Equivalently, 


=put+a;+t B;, 


In py = In (pi+) + In (P4;) 


418 Chapter 10: Categorical Data Analysis in Complex Surveys 


where 


Sone and 8 <0. 
i=1 j=l 


This is called a loglinear model because the logarithms of the cell probabilities follow 
a linear model. The model for independence in a 2 x 2 table may be written as 


y = XB, 
where 
In (Pi) 11 1 J 
_ | In(py) nt ee 2 
I | nga |e ae: ae = i 
In (p22) Pasi : 


The parameters B are estimated using the estimated probabilities p,;. For the data in 
Example 10.1, the estimated probabilities are as follows: 


Computer? 
Yes No 


Yes 0.238 0.376 0.614 


Cable? 
No 0.176 0.210 0.386 
0.414 0.586 1.000 


The parameter estimates are (4 = —1.428, @ = 0.232, and Bi = —0.174. The 
fitted values of p,; for the model of independence are then 
Pi = exp( + Oj + Bj) 
and are given in the following table: 


Computer? 
Yes No 


Yes 0.254 0.360 0.614 


Cable? 
No 0.160 0.226 0.386 


0.414 0.586 1.000 


We would also like to see how well this model fits the data. We can do that in two 
ways: 


1 Test the goodness of fit of the model using either X? in (10.5) or G? in (10.6): 
For a two-way contingency table, these statistics are equivalent to the statistics for 
testing independence. For the computer/cable example, the likelihood ratio statis- 
tic for goodness of fit is 2.27. In multinomial sampling, X? and G? approximately 
follow a eae ) distribution if the model is correct. 


2 A full, or saturated, model for the data can be written as 


log py = UW +a; + B + (OB); 
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with >, (@B) = a (aB);; = 0. The last term is analogous to the interaction 
term in a two-way ANOVA model. This model will give a perfect fit to the observed 
cell probabilities because it has rc parameters. The null hypothesis of independence 


is equivalent to 
Ao : (ap) = Ofori=1,...,r-—1jj=l,...,c—1. 


Standard statistical software packages give estimates of the (wB),’s and their 
asymptotic standard errors under multinomial sampling. For the saturated model 
in the computer/cable example, SAS PROC CATMOD gives the following: 


Standard Chi- 
Effect Parameter Estimate Error Square Prob 
CABLE 1 0.2211 0.0465 22.59 0.0000 
COMP 2 -0.1585 0.0465 11.61 0.0007 
CABLE* COMP 3 -0.0702 0.0465 2.28 0.1313 


The values in the column “Chi-Square” are the Wald test statistics for testing 
whether that parameter is zero. Thus the p-value, under multinomial sampling, for 
testing whether the interaction term is zero is 0.1313—again, for this example, 
this is exactly the same as the p-value from the test for independence. 


10.42 Loglinear Models in a Complex Survey 


What happens in a complex survey? We obtain point estimates of the model parameters 
like we always do, by using weights. Thus, we estimate pj by (10.7), and calculate 
pseudo-maximum likelihood estimates of the loglinear model parameters incorpo- 
rating the weights. If we use software that does not account for the survey design, 
however, the test statistics for goodness of fit and the asymptotic standard errors for 
the parameter estimates will be wrong. 

Many of the same corrections used for x” tests of independence can also be used 
for hypothesis tests in loglinear models. Rao and Thomas (2003) describe various 
tests of goodness of fit for contingency tables from complex surveys; these include 
Wald tests, jackknife, and Rao—Scott corrections to X 2 and G?. 

The Bonferroni inequality may also be used to compare nested loglinear models. 
For testing independence in a two-way table, for example, we compare the saturated 
model with the reduced model of independence and test each of the m = (r—1)(c—1) 
null hypotheses 


Ao(1) : (@B)11 = 0 


Ho(m) : (@B)~—1y(e-1) = 


separately at level a/m. 
More generally, we can compare any two nested loglinear models using this 
method. For a three-dimensional r x c x d table, let 


y = [In (pint) In(pirz) -- IN rea)" 
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Suppose the smaller model is 


and the larger model is 


y=XB+ 26, 


where @ is a vector of length m. Then we can fit the larger model, and perform m 


separate hypothesis tests of the null hypotheses 


Ho : 6; = 0, 


each at level a/m, by comparing 6; / SE(6;) to at distribution. 


EXAMPLE 10.8 Let’s look at a three-dimensional table from the Survey of Youths in Custody, to 
examine relationships among the variables “Was anyone in your family ever incarcer- 
ated?” “Have you ever been put on probation or sent to a correctional institution for a 
violent offense?” and age. The cell probabilities are pj. The estimated probabilities 
Pijk, estimated using weights, are in the following table: 


Family Member Incarcerated? 


No Yes 
Ever Violent? Ever Violent? 
No Yes No Yes Total 
Age <15 0.0588 0.0698 0.0659 0.0856 0.2801 
Class 16-17 0.0904 0.1237 0.0944 0.1375 0.4461 
> 18 0.0435 0.0962 0.0355 0.0986 0.2738 
Total 0.1928 0.2896 0.1959 0.3217 1.0000 


The saturated model for the three-way table is 


log Pik = +; + Bi + ¥% + (OB) + (OV) in + (BY) ix + (COBY ijx- 


SAS PROC CATMOD, using the weights, gives the following parameter estimates 


for the saturated model: 


Standard Chi- 

Error Square Prob 
0.00980 137.45 0.0000 
0.00884 1515.52 0.0000 
0.00685 1275.26 0.0000 
0.00980 194.27 0.0000 
0.00884 67.04 0.0000 
0.00685 12.51 0.0004 
0.00980 32.03 0.0000 
0.00884 2.10 0.1473 
0.00685 21.42 0.0000 
0.00980 0.82 0.3646 
0.00884 3.33 0.0680 


Effect Parameter Estimate 
AGECLASS 1 -0.1149 
2 0.3441 
EVERVIOL 3 -0.2446 
AGECLASS*EVERVIOL 4 0.1366 
5 0.0724 
FAMTIME 6 0.0242 
AGECLASS*FAMTIME 7 0.0555 
8 0.0128 
EVERVIOL* FAMTIME 9 -0.0317 
AGECLAS* EVERVIOL* FAMTIME 10 0.00888 
11 0.0161 
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Because this is a complex survey, and because SAS PROC CATMOD acts as though 
the sample size is }* w; when the weights are used, the standard errors and p-values 
given for the parameters are completely wrong. But we can estimate the variance of 
each parameter by refitting the loglinear model on each of the random groups, and use 
the random group estimate of the variance to perform hypothesis tests on individual 
parameters. The random group standard errors for the 11 parameters are: 


Parameter Estimate Standard error Test statistic 
1 —0.1149 0.1709 —0.67 
2, 0.3441 0.0953 3.61 
3 —0.2446 0.0589 —4.15 
4 0.1366 0.0769 1.78 
5 0.0724 0.0379 1.91 
6 0.0242 0.0273 0.89 
7 0.0555 0.0191 2.91 
8 0.0128 0.0218 0.59 
9 —0.0317 0.0233 —1.36 

10 0.0089 0.0191 0.47 
11 0.0161 0.0167 0.96 


The null hypothesis of no interactions among variables is 


Ho : (af) = (ay)ix = (BY) jx = (OBY)ijx = 0; 
or, using the parameter numbering in the output, 


Ho : Ba = Bs = By = Bg = Bo = Bio = Bu = O. 


This null hypothesis has seven components; to use the Bonferroni test, we test each 
individual parameter at the 0.05/7 level. The (1— .05/14) percentile of a fg distribution 
is 4.0; none of the test statistics B;/SE(B;), for i = 4,5,7,8,9, 10, 11, exceed that 
critical value, so we would not reject the null hypothesis that all three variables are 
independent. We might want to explore the ageclass x famtime interaction further, 
however. um 


10.5 


Chapter Summary 


Since many surveys collect categorical data, we often want to perform chi-square tests 
to explore association among variables. We can estimate probabilities in contingency 
tables using the sampling weights. Pearson and likelihood-ratio chi-square tests for 
association must be modified to account for the stratification and clustering in the 
survey design. Wald tests use the design-based variance so that the Wald test statistic 
approximately follows a chi-square distribution. The Rao—Scott test modifies the 
usual Pearson or likelihood-ratio test statistics by the average design effect to obtain 
corrected p-values. 
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Exercises 
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Key Terms 


Loglinear model: A model used for associations in categorical data. 


Rao-Scott correction: A modification to a chi-square test statistic to account for 
the complex survey design. 


Wald test statistic: A statistic for testing Ho : 6 = 60 of the form (6 — 60)" [V(6)}! 
(6 — 00), in which the variance estimator accounts for the complex survey design. 


For Further Reading 


Agresti (2002) and Simonoff (2006) are good references on the analysis of categor- 
ical data in non-survey situations. The books edited by Skinner et al. (1989a) and 
Chambers and Skinner (2003) contain chapters on categorical data analysis on com- 
plex survey data. 

A number of methods have been proposed to account for the survey design when 
testing for goodness of fit, homogeneity of populations, and independence of variables. 
Thomas et al. (1996) describe more than 25 methods that have been developed for 
testing independence in two-way tables and provide a useful bibliography. Some of 
these methods, and variations, are described in more detail in Rao and Thomas (1988, 
1989, 2003). Fay (1985) describes an alternative method that involves jackknifing 
the test statistic itself. Scott (2007) reviews Rao-—Scott corrections and outlines other 
areas of application. 


A. Introductory Exercises 


Find an example or exercise in an introductory statistics textbook that performs a 
chi-square test on data from a survey. What design do you think was used for the 
survey? Is a chi-square test for multinomial sampling appropriate for the data? Why, 
or why not? 


Read one of the articles listed in the file chapter1Opapers.html on the book website, 
or another research article in which a categorical data analysis is performed on survey 
data. Describe the sampling design and the method of analysis. Did the authors account 
for the design in their data analysis? Should they have analyzed the data differently? 


Schei and Bakketeig (1989) took an SRS of 150 women between 20 and 49 years of 
age from Trondheim, Norway. Their goal was to investigate the relationship between 
sexual and physical abuse by a spouse and certain gynecological symptoms in the 
women. Of the 150 women selected to be in the sample, 15 had moved, | had died, 
3 were excluded because they were not eligible for the study, and 13 refused to 
participate. 

Of the 118 women who participated in the study, 20 reported some type of sexual or 
physical abuse from their spouse: eight reported being hit, two being kicked or bitten, 
seven being beaten up, and three being threatened or cut with a knife. Seventeen of 
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the women in the study reported a gynecological symptom of irregular bleeding or 
pelvic pain. The numbers of women falling into the four categories of gynecological 
symptom and abuse by spouse are given in the following contingency table: 


c 


Abuse 
No Yes 


No 89 12 101 


Gynecological Symptom Present? 
Yes 9 8 17. 


98 20 118 


If abuse and presence of gynecological symptoms are not associated, what are the 
expected probabilities in each of the four cells? 


Perform a x? test of association for the variables abuse and presence of gyneco- 
logical symptoms. 


What is the response rate for this study? Which definition of response rate did 
you use? Do you think that the nonresponse might affect the conclusions of the 
study? Explain. 


Samuels (1996) collected data to examine how well students do in follow-up courses 
if the prerequisite course is taught by a part-time or full-time instructor. The following 
table gives results for students in Math I and Math II. 


Instructor Instructor Grade in Math II 

for Math I for Math II A, B, or C D, F, or Withdraw Total 

Full Time Full Time 797 461 1258 

Full Time Part Time 311 181 492 

Part Time Full Time 570 480 1050 

Part Time Part Time 909 449 1358 
Total 2587 1571 4158 


The null hypothesis here is that the proportion of students receiving an A, B, or 
C is the same for each of the four combinations of instructor type. Is this a test of 
independence, homogeneity, or goodness of fit? 

Perform a hypothesis test for the null hypothesis in (a), assuming students are 
independent. 


Do you think the assumption that students are independent is valid? Explain. 


B. Working with Survey Data 


In Example 10.5 we used linearization to estimate V6). We can alternatively use the 
random group method to estimate the variance of 6. Calculate PuP2 — Pi2P2 for 
each of the seven random groups, as discussed in Example 9.4, and find the variance 
of the seven nearly independent estimates of 0. Form the Wald statistic based on your 
estimated variance. Since the estimate of the variance from the random groups method 
has only six df, the test statistic should be compared to an F,¢ distribution rather than 
toa oe distribution. Do you reach the same conclusion as in Example 10.5? 
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Use the file winter.dat for this exercise. The data were first discussed in Exercise 19 
of Chapter 3. 


Test the null hypothesis that class is not associated with breakaga. In the context 
of Section 10.1, what type of sampling was done? 


b Now construct a 2 x 2 contingency table for the variables breakaga and work. 
Use the sampling weights to estimate the probabilities pj; for each cell. 

c Calculate the odds ratio using the p; from (b). How does this compare with an odds 
ratio calculated using the observed counts (and ignoring the sampling weights)? 

d Estimate 6 = pi1p22 — p2ipi2 using the Dij you calculated in (b). 

e Test the null hypothesis Ho : 6 = 0. 

f How did the stratification affect the hypothesis test? 

Use the file teachers.dat for this exercise. The data were first discussed in Exercise 15 

of Chapter 5. 

a Construct a new variable zassist, which takes on the value 1 if a teacher’s aide 


spends any time assisting the teacher, and 0 otherwise. Construct another new 
variable zprep, which takes on values Low, Medium, and High based on the 
amount of time the teacher spends in school on preparation. 


Construct a 2 x 3 contingency table for the variables zassist and zprep. Use the 
sampling weights to estimate the probabilities pj for each cell. 


Using the Rao—Scott method, test the null hypothesis that zassist is not associated 
with zprep. 


The following data are from the Canada Health Survey, and given in Rao and Thomas 
(1989, p. 107) . They relate smoking status (current smoker, occasional smoker, never 
smoked) to fitness level for 2505 persons. Smokers who had quit were not included 
in the analysis. The estimated proportions in the table below were found by applying 
the sample weights to the sample. The design effects are in brackets. We would like 
to test whether smoking status and fitness level are independent. 


Fitness level 

Smoking Minimum 
Status Recommended acceptable Unacceptable 
Current 0.220 [3.50] 0.150 [4.59] 0.170 [1.50] 0.540 [1.44] 
Occasional 0.023 [3.45] 0.010 [1.07] 0.011 [1.09] 0.044 [2.32] 
Never 0.203 [3.49] 0.099 [2.07] 0.114 [1.51] 0.416 [2.44] 
Total 0.446 [4.69] 0.259 [5.96] 0.295 [1.71] 1.000 

a What is the value of X? if you assume the 2505 observations were collected 
in a multinomial sample? Of G?? What is the p-value for each statistic under 
multinomial sampling, and why are these p-values incorrect? 

b Using (10.9) find the approximate expected value of X? and G?. 

c Calculate the corrected statistics X7 and G7, for these data, and find p-values for 


the hypothesis tests. Does the clustering in the Canada Health Survey make a 
difference in the p-value you obtain? 


10 


11 


12 


13 
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The following data are from Rao and Thomas (1988), and were collected in the 
Canadian Class Structure Survey, a stratified multistage sample collected in 1982-83 
to study employment and social structure. Canada was divided into 35 strata by region 
and population size; two psus were sampled in 34 of the strata, and one psu sampled 
in the 35th stratum. Variances were estimated using balanced repeated replication 
using the 34 strata with two psus. Estimated design effects are in brackets behind the 
estimated proportion for each cell. 


Males Females Total 
Decision-making managers 0.103 [1.20] 0.038 [1.31] 0.141 [1.09] 
Advisor-managers 0.018 [0.74] 0.016 [1.95] 0.034 [1.95] 
Supervisors 0.075 [1.81] 0.043 [0.92] 0.118 [1.30] 
Semi-autonomous workers 0.105 [0.71] 0.085 [1.85] 0.190 [1.44] 
Workers 0.239 [1.42] 0.278 [1.15] 0.516 [1.86] 
Total 0.540 [1.29] 0.460 [1.29] 


a Whatis the value of X? if you assume the 1463 persons were surveyed in a simple 
random sample? Of G?? What is the p-value for each statistic under multinomial 
sampling, and why are these p-values incorrect? 

b Using (10.9), find the approximate expected value of X? and G?. 

c How many df are associated with the BRR variance estimates? 

d Calculate the first-order corrected statistics AG. and Gi, for these data, and find 
approximate p-values for the hypothesis tests. Does the clustering in the survey 
make a difference in the p-value you obtain? 

e The second-order Rao—Scott correction gave test statistic xe = 38.4, with 3.07 df. 
How does the p-value obtained using the xe compare with the p-value from X7? 


Using the data in syc.dat, define the variable currprop as | if crimtype = 2 and 0 
otherwise. Perform a Rao-Scott test of whether currprop is associated with age group, 
for the groups given in Example 10.6. Also give the table of estimated probabilities 
for the cross-classification. 

Using the NHANES data in nhanes.dat, categorize the respondents in three groups: 
normal, with body mass index (bmxbmi) less than 25; overweight, with 25 < bmxbmi 
< 30, and obese, with bmxbmi > 30. Also create two age groups, of persons under age 
30 and persons at least 30 years old. Create a table of the estimated probabilities for 
the cross-classification of these two categorical variables, and perform a Rao-Scott 
test of association. 

Using the data in ncvs2000.dat, test whether being a victim of violent crime is asso- 
ciated with gender. 


C. Working with Theory 


Some researchers have used the following method to perform tests of association in 
two-way tables. Instead of using the original observation weights w;, define 


nNWk 
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dm 
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Wi 
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where n is the number of observation units in the sample. The sum of the new weights 
w;, then, is n. The “observed” count for cell (i,j) is 


Xj = sum of the ws for observations in cell (i,j) 


and the “expected” count for cell (i,j) is mj = (xi4x4;)/n. Then compare the test 
statistic 


i=1 j=l mij 
toa Xbr—1e-1) distribution. 
Does this test give correct p-values for data from a complex survey? Why, or why 
not? HINT: Try it out on the data in Examples 10.1 and 10.4. 


(Requires calculus.) Consider Xi in (10.8). 

a Use the linearization method of Section 9.1 to approximate V(@) in terms of Vi) 
and Cov(pj;, Pu). Show that if we let yj, = 1 if observation k is in cell (i, 7) and 0 
otherwise, then V(0) = V(q), where gx = P22Yi1k + Piry22% — Pi2y21k — P21Y12k- 

b What is the Wald statistic, using the linearization estimate of V(6) in (a), when 
multinomial sampling is used? (Under multinomial sampling, V(p;)= 
pil — pi)/n and Cov(pj, Pi) = —pipu/n.) Is this the same as Pearson’s X? 
statistic? 


(Requires calculus.) Estimating the log odds ratio in a complex survey. Let 


0 = log (az) and @= log (Pe) : 
P12P21 P12P21 


a__ Use the linearization method of Section 9.1 to approximate V(@) in terms of Vi) 
and Cov(pj, Px). 


Ae 1 1 1 1 
b_ Using (a), show that V(@) = + + + under multinomial sampling. 
X11 X120—s X21 X22 


In Section 10.3.1, we used a Wald test for Hp : 0 = 0, where 6 = 6); = p11p22 — Pi2p21- 
An equivalent null hypothesis is Ho : 7 =0, where 7 = log [or 1p22)/(Pr2p21)]. Using 
the result of Exercise 15, derive the Wald test statistic for Hp : 7 =0. 


Show that for multinomial sampling, x? = X?. Hint: What is E[X?] in (10.9) for a 
multinomial sample? 


(Requires mathematical statistics and theory of linear models.) Deriving the first- and 

second-order corrections to Pearson’s X? (see Rao and Scott, 1981). 

a Suppose the random vector Y is normally distributed with mean 0 and covariance 
matrix ©. Then, if C is symmetric and positive definite, show that Y’ CY has the 
same distribution as }* A;W;, where the W,’s are independent Xe random variables 
and the ,’s are the eigenvalues of CX. 

b Let@= @11,...,A1¢-1))-- + 0-1 ++» r-1(e-1))» Where 6, = Pi — Pit P+. 
Let A be the covariance matrix of 0 if a multinomial sample of size n is taken and 


the null hypothesis is true. Using (a), argue that 6 A-'6 asymptotically has the 
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same distribution as )> A;W; , where the W; are independent xt random variables, 
and the 4,;’s are the eigenvalues of A~!V(6). 


ce What are Elo A-'6] and V[o A-'6] in terms of the i;’s? 


d Find E[6 A~'6] and V[a A-!6] for a 2 x 2 table. You may want to use your 
answer in Exercise 14. 


We know the clustering structure for the data in Example 10.7. Use results from 
Chapter 5 (assume one-stage cluster sampling) to estimate the proportion for each 
cell and margin in the 2 x 2 table, and find the variance for each estimated proportion. 
Now use estimated design effects to perform a hypothesis test of independence using 
Xe How do the results compare to the model-based test? 


D. Projects and Activities 


Trucks. Use the VIUS data described in Exercise 34 of Chapter 3. Define the variable 
heavy to be 1 if the gross vehicle weight is higher than 10,000 pounds and 0 otherwise, 
and define the variable autotran to be 1 if the vehicle has automatic transmission and 0 
otherwise. Using the sample weights, construct a2 x 2 table of estimated probabilities 
for the cross-classification of heavy and autotran. What is the design effect for each 
estimated proportion? Carry out a Rao—Scott test for independence. How do the results 
compare with a Wald test for independence? 


Baseball data. For the sample you selected in Exercise 37 of Chapter 5, define the 

variable pitcher to be | if the player is a pitcher and 0 otherwise, and the variable 

million to be 1 if the salary is greater than $1 million and 0 otherwise. 

a___ Test whether the variables pitcher and million are associated, using the first-order 
Rao-Scott test. 


b Using the sampling weights, estimate the log odds ratio. 


IPUMS exercises. Use your sample from Exercise 28 of Chapter 7 for this problem. 

a Create a categorical variable from inctot with two categories: low income and 
high income. Use the median income as the dividing point for the categories in 
the new variable catinc. 


b Conduct hypothesis tests to explore whether catinc is associated with (i) race or 
(ii) sex. What method did you use to account for the complex sampling design? 


Activity for course project. Return to the survey you explored in Exercise 31 of 
Chapter 7. Now consider two categorical responses in the survey. Construct a two- 
way table of estimated proportions, using the weight variable. Conduct a hypothesis 
test to explore whether these variables are associated. What method did you use to 
account for the complex sampling design? 
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EXAMPLE I1.1 


Regression with Complex 
Survey Data 


Now he knew that he knew nothing fundamental and, like a lone monk stricken with a conviction of sin, 
he mourned, “If | only knew more! ...Yes, and if | could only remember statistics!” 


—Sinclair Lewis, /t Can’t Happen Here 


How are maternal drug use and smoking related to birth weight and infant mortality? 
What variables are the best predictors of neonatal mortality? How is the birth weight 
of an infant related to that of older siblings? 

In most of this book, we have emphasized estimating population means and 
totals—for example, how many low-birth-weight babies are born in the United States 
each year? Questions on the relation between variables, however, are often answered 
in statistics by using some form of a regression analysis. A response variable (for 
example, birth weight) is related to a number of explanatory variables (for example, 
maternal smoking, family income, and maternal age). We would like to be able to use 
the resulting regression equation not only to identify the relationship among variables 
for our data, but also to predict the value of the response for future infants, or for 
infants not included in the sample. 

You know how to fit regression models if the “usual assumptions,” reviewed in 
Section 11.1, are met. These assumptions are often not met for data from complex 
surveys, however. To answer the questions above, for example, you might want to 
use data from the 1988 Maternal and Infant Health Survey (MIHS) in the United 
States. The survey, collected by the U.S. Census Bureau for the National Center for 
Health Statistics, provides data on a number of factors related to pregnancy and infant 
health, including weight gain, smoking, and drug use during pregnancy; maternal 
exposure to toxic wastes and hazards; and complications during pregnancy and deliv- 
ery (Sanderson et al., 1991). But, like most large-scale surveys, the MIHS is not a 
simple random sample (SRS). Stratified random samples were drawn from the 1988 
vital records from the contiguous 48 states and the District of Columbia. The sam- 
ples included 10,000 certificates of live birth from the 3,909,510 live births in 1988, 
4000 reports of fetal death from the estimated 15,000 fetal deaths of 28 weeks’ or 
more gestation, and 6000 certificates of death for infants under | year of age from 
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the population of 38,910 such deaths. Because black infants have higher incidence of 
low birth weight and infant mortality than white infants, black infants had a higher 
sampling fraction than nonblack infants. Low-birth-weight infants were also over- 
sampled. Mothers in the sampled records were mailed a questionnaire asking about 
prenatal care; smoking, drinking, and drug use; family income; hospitalization; health 
of the baby; and a number of other related variables. After receiving permission from 
the mother, investigators also sent questionnaires to the prenatal care providers and 
hospitals, asking about the mother’s and baby’s health before and after birth. = 


As we found for analysis of contingency tables in the previous chapter, unequal 
probabilities of selection and the clustering and stratification of the sample complicate 
a Statistical analysis. In the MIHS, the unequal inclusion probabilities for infants in 
different strata may need to be considered when fitting regression models. If a survey 
involves clustering, as does the National Crime Victimization Survey (NCVS), then 
standard errors for the regression coefficients calculated under the assumption that 
observations are independent will be incorrect. 

In this chapter, we explore how to do regression in complex sample surveys. We 
review the traditional model-based approach to regression analysis, as taught in intro- 
ductory statistics courses, in Section 11.1. In Section 11.2, we discuss a design-based 
approach to regression, and present methods for calculating standard errors of regres- 
sion coefficients. Section 11.4 contrasts design-based and model-based approaches, 
Section 11.5 discusses a model-based approach, and Section 11.6 applies these ideas 
to logistic regression. 

We already used regression estimation in Chapter 4. In Chapter 4, though, the 
emphasis was on using information in an auxiliary variable to increase the precision 
of the estimate of the population total, 4, = eo , ¥i- In Sections 11.1 to 11.6 of this 
chapter, our primary interest is in exploring the relation among different variables, 
and thus in estimating the regression coefficients. In Section 11.7, we return to the 
use of regression for improving the precision of estimated totals. 


Model-Based Regression in Simple Random 


Samples 


As usually exposited in areas of statistics other than sampling, regression inference is 
based on a model that is assumed to describe the relationship between the explanatory 
variable, x, and the response variable, y. The straight-line model commonly used for 
a single explanatory variable is 


Y; | x; = Bo + Bixi + €i, (11.1) 


where Y; is a random variable for the response, x; is an explanatory variable, and Bo 
and 6; are unknown parameters. The Y;’s are random variables; the data collected in 
the sample of size n are one realization of those n random variables, {y;,i € S}. The 
€;°S, the deviations of the response variable about the line described by the model, are 
assumed to satisfy conditions (A1) through (A3): 
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(Al) E[e;] = 0 for all i. In other words, E[Y;|x;] = Bo + Bixi. 

(A2) V[e;] = o7 for all i. The variance about the regression line is the same for all 
values of x. 

(A3) Cov [¢;, €;] = 0 for i A j. Observations are uncorrelated. 


Often, (A4) is also assumed: It implies (A1) through (A3), and adds the additional 
assumption of normally distributed ¢;’s. 


(A4) Conditionally on the x;’s, the ¢;’s are independent and identically distributed 


from a normal distribution with mean 0 and variance o”. 


The ordinary least squares (OLS) estimators of the parameters are the values 
Bo and B | that minimize the residual sum of squares )~ [y; — (Bo + B 1x;)°. Estimators 
of the slope 6; and intercept Bp are obtained by solving the normal equations: For 
the model in (11.1), these are 


Bon + Bi oxi =o 
Bo Doi + Bi Yo x7 = DO mi. 


Solving the normal equations gives the ae estimators 


il Ear (11.2) 
Bo = ~ yy — Box. 


Both Bi and Bo are linear in y, as we can write each in the form )> a;y; for known 
constants a;. Although not usually taught in this form, it is equivalent to (11.2) to 
write 


and 


= : 1 ; I 5) Vie 
ieS xj _ x (>) 


If assumptions (A1) to (A3) are satisfied, then Bo and Bi are the best linear unbiased 
estimators—among all linear estimators that are unbiased under model (11.1), Bo and 
B , have the smallest variance. If assumption (A4) is met, we can use the ¢ distribution 
to construct confidence intervals (CIs) and hypothesis tests for the slope and intercept 
of the “true” regression line. Under assumption (A4), 


B-B Bi 


V/ Vu(B1) 
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follows a ¢ distribution with n — 2 degrees of freedom (df). The subscript M refers 
to the use of the model to estimate the variance; for model (11.1), a model-unbiased 
estimator of the variance is 


>> Oi — Bo — Bix)? /( — 2) 


ieS 
Yo i — x” 


ieS 


Vu(B.) = 


(11.3) 


The coefficient of determination R? in straight-line regression is 


> 6: — Bo — Bim? 


icS 


>> 01-57 


icS 


R=1 


To illustrate regression in the setting just discussed, we use data from Macdonell 
(1901), giving the length of the left middle finger (cm) and height (inches) for 3000 
criminals. At the end of the nineteenth century, it was widely thought that criminal 
tendencies might also be expressed in physical characteristics that were distinguish- 
able from the physical characteristics of noncriminal classes. Macdonell compared 
means and correlations of anthropometric measurements of the criminals to those 
of Cambridge men (presumed to come from a different class in society). This is an 
important data set in the history of statistics—it is the one Student (1908) used to 
demonstrate the r distribution. The entire data set for the 3000 criminals is in the file 
anthrop.dat. 

An SRS of 200 individuals (file anthsrs.dat) was taken from the 3000 observations. 
Fitting a straight line model (SAS code is given on the website) with y = height and 
x = (length of left middle finger) results in the following output: 


Parameter Standard 
Variable DF Estimate Error t Value Pr > |[t| 
Intercept 1 30.31625 2.56681 11.81 <.0001 
finger 1 3.04525 0.22172 13.73 <.0001 


The sample data are plotted along with the OLS regression line in Figure 11.1. The 
model appears to be a good fit to the data (R? = 0.49), and, using the model-based 
analysis, a 95% CI for the slope of the line is 


3.0453 + 1.972(0.2217) = [2.61, 3.48]. 


If we generated samples of size 200 from the model in (11.1) over and over again and 
constructed a CI for the slope for each sample, we would expect 95% of the resulting 
ClIs to include the true value of 6}. = 


Here are some remarks relevant to the application of regression to survey data: 


1 No assumptions whatsoever are needed to calculate the estimates Bo and Bi from 
the data; these are simply formulas. The assumptions in (A1) to (A4) are needed 
to make inferences about the “true” but unknown parameters fo and f; and about 
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FIGURE 11.1 

A plot of height vs. finger length for an SRS of 200 observations. The area of each circle is 
proportional to the number of observations at that value of (x,y). The OLS regression line, 
drawn in, has equation y = 30.32 + 3.05x. 
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predicted values of the response variable. So the assumptions are used only when 
we construct a CI for f; or for a predicted value, or when we want to say, for 
example, that 6; is the best linear unbiased estimator of 61. 


The same holds true for other statistics we calculate. If we take a convenience 
sample of 100 persons, we may always calculate the average of those persons’ 
incomes. But we cannot assess the accuracy of that statistic unless we make model 
assumptions about the population and sample. With a probability sample, however, 
we can use the sample design itself to make inferences and do not need to make 
assumptions about the model. 


2 If the assumptions are not at least approximately satisfied, model-based infer- 
ences about parameters and predicted values will likely be wrong. For example, if 
observations are positively correlated rather than independent, the variance esti- 
mate from (11.3) is likely to be smaller than it should be. Consequently, regression 
coefficients are likely to be deemed statistically significant more often than they 
should be, as demonstrated in Kish and Frankel (1974). 


3 Wecan partially check the assumptions of the model by plotting the residuals and 
using various diagnostic statistics as described in the regression books listed in the 
reference section. One commonly used plot is that of residuals versus predicted 
values, used to check (A1) and (A2). For the data in Example 11.2, this plot is 
shown in Figure 11.2, and gives no indication that the data in the sample violate 
assumptions (A1) or (A2). (This does not mean that the assumptions are true, 
just that we see nothing in the plot to indicate that they do not hold. Some of the 
assumptions, particularly independence, are quite difficult to check in practice.) 
However, we have no way of knowing whether observations not in the sample are 
fit by this model unless we actually see them. 


4 Regression is not limited to variables related by a straight line. Let y be birth 
weight, and x take on the value | if the mother is black and 0 if the mother is 
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FIGURE 11.2 
A plot of residuals for the model-based analysis of criminal height data, using the SRS plotted 
in Figure 11.1. No patterns are apparent. 
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not black. Then the regression slope estimates the difference in mean birth weight 
for black and nonblack mothers, and the test statistic for Hp : 6; = O is the 
pooled t-test statistic for the null hypothesis that the mean birth weight for blacks 
is the same as the mean birth weight for nonblacks. Thus comparison of means for 
subpopulations, or domains, can be treated as a special case of regression analysis, 
as will be seen in Section 11.3. 


1.2 


Regression in Complex Surveys 


Many investigators performing regression analyses on complex survey data simply 
run the data through standard software for the model in (11.1) and report the parameter 
estimates and standard errors given by the software. One may debate whether to take 
a model-based or design-based approach (and we shall, in Section 11.4), but the data 
structure needs to be taken into account in either approach. 

What can happen in complex surveys? 


1 Observations may have different inclusion probabilities, z;. If z; is related to 
the response variable y;, then an analysis that does not account for the different 
probabilities of selection may lead to biases in the estimated regression parameters. 
This problem is discussed in detail by Nathan and Smith (1989), who give a 
bibliography of related literature. 


For example, suppose that an unequal-probability sample of 200 men is taken from 
the population described in Example 11.2 and that the inclusion probabilities are 
higher for the shorter men. (For illustration purposes, I used the y;’s to set the 
inclusion probabilities, with 7; proportional to 24 for y < 65, 12 for y = 65, 2 
for y = 66 or 67, and | for y > 67, with data in file anthuneq.dat.) Figure 11.3 
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FIGURE 11.3 

A plot of y vs. x for an unequal-probability sample of 200 criminals. In this plot, the area of 
the circle is proportional to the number of observations at that data point—not to the sum of 
weights at the point. The OLS line, ignoring the sampling weights, is y = 43.41 + 1.79x. The 
smaller slope of this line, when compared to the slope 3.05 for the SRS in Figure 11.1, reflects 
the undersampling of tall men. The OLS regression estimators are biased for the population 
quantities because they do not incorporate the unequal sampling weights. 
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shows a scatterplot of the data from this sample, along with the ordinary least 
squares regression line described in Section 11.1. The OLS regression equation is 
y = 43.41 + 1.79x, compared with the equation y = 30.32 + 3.05x for the SRS 
in Example 11.2. Ignoring the inclusion probabilities in this example leads to a 
very different estimate of the regression line and distorts the relationship in the 
population. 


Nonrespondents can distort the relationship for much the same reason. If the non- 
respondents in the MIHS are more likely to have low-birth-weight infants, then a 
regression model predicting birth weight from explanatory variables may not fit 
the nonrespondents. Item nonresponse may have similar effects. 


The stratification of the MIHS would also need to be taken into account. The sur- 
vey was stratified because the investigators wanted to be sure to have an adequate 
sample size for blacks and low-birth-weight infants. It is certainly plausible that 
each stratum may have its own regression line, and postulating a single straight 
line to fit all the data may hide some of the information in the data. 


2  Evenif the estimators of the regression parameters are approximately design unbi- 
ased, the standard errors given by non-survey regression programs will likely be 
wrong if the survey design involves clustering. Usually, with clustering, the design 
effect (deff) for regression coefficients will be greater than 1. 


11.21 Point Estimation 


Traditionally, design-based sampling theory has been concerned with estimating 
quantities from a finite population, quantities such as t, = eae yj or yy = 8, /N. 
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FIGURE 11.4 

A plot of the population of 3000 criminals. The area of each circle is proportional to the 
number of population observations at those coordinates. The population OLS regression line 
is y = 30.18 + 3.06x. 
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In that descriptive spirit, then, the finite population quantities of interest for regres- 
sion are the least squares coefficients for the population, Bo and B,, that minimize 


N 
Pe (yi — Bo — Bixiy” 
i=l 
over the entire finite population. It would be nice if the equation y = By + Byx 
summarizes useful information about the population (otherwise, why are you really 
interested in By and B,?), but no assumptions are necessary to say that these are the 
quantities of interest. As in Section 11.1, the normal equations are 


N N 

BN +B, ox =) yi 
is ves - 

Bo oxi + By Be = Yo xy, 
i=l i=l i=l 


and Bo and B, can be expressed as functions of the population totals: 


N \ N ss 
2% = N (> s) (> ») we ih 


N 
B= = 11.4 
1 N F N 2 (t) ( ) 
2 2 > t2- 
eS Xj . 
i=1 N (> ™ 
N N 
1 1 ty — Bit, 
a Seer eo Bee (11.5) 
N i=1 N i=1 N 


We know the values for the entire population for the sample drawn in Exam- 
ple 11.2. These population values are plotted in Figure 11.4, along with the population 
least squares line y = 30.179 + 3.056x. 
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As both Bo and B, are functions of population totals, we can use methods derived 
in earlier chapters to estimate each total separately and then substitute the estimators 
into (11.4) and (11.5). We estimate each population total in (11. 4) and (11.5) using 


weights, with N = Dies Wir i= = Vices Widi> c= ies Wixi» ig = Vies wiaiyi, and 
ta =D. jg war. Then, 


> WiXi Vi — §3 ve) bs vo 
icS > Wi \ieS icS 


ieS (11.6) 


2 
Swat (Ho 
icS = Wi \ieS 


icS 


&, 
Il 


and 


Wii — B ye WX; 
Bo at ieS ieS ; (11.7) 


dm 


icS 


Computational Note Although (11.6) and (11.7) are correct expressions for the esti- 
mators, they are subject to roundoff error and are not as good for computation as 
other algorithms that have been developed. In practice, you should use professional 
software designed for estimating regression parameters in complex surveys. If you do 
not have access to such software, you can use any statistical regression package that 
calculates weighted least squares estimates. If you use weights w; in the weighted least 
squares estimation, you will obtain the same point estimates as in (11.6) and (11.7); 
however, in complex surveys, the standard errors and hypothesis tests the software 
provides will be incorrect and should be ignored. 


Plotting the Data In any regression analysis, you must plot the data. Plotting mul- 
tivariate data is challenging even for data from an SRS (Cook and Weisberg, 1994, 
discuss regression graphics in depth). Data from a complex survey design—with strat- 
ification, unequal weights, and clustering—have even more features to incorporate 
into plots. Some bivariate plots for survey data are discussed in Section 7.4.2. In 
Figure 11.5, we indicate the weighting by circle area. 


Let’s estimate the finite population quantities By and B, for the unequal-probability 
sample plotted in Figure 11.3. The point estimates, using the weights, are Bo = 30.19 
and B, = 3.05. If we ignored the weights and simply ran the observed data through a 
standard regression program such as SAS PROC REG, we get very different estimates: 
By = 43.41 and B, = 1.79—the values in Figure 11.3. 

Figure 11.5 shows why the weights, which were related to y, make a difference 
here. Taller men had lower inclusion probabilities and thus not as many of them 
appeared in the unequal-probability sample. However, the taller men that were selected 
had higher sampling weights; a 69-inch man in the sample represented 24 times 
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FIGURE 11.5 

A plot of data from an unequal-probability sample. The area of each circle is proportional to 
the sum of the weights for observations with that value of x and y. Note that the taller men in 
the sample also have larger weights, so the slope of the regression line using weights is drawn 
upward. The regression line, calculated with the weights, is y = 30.19 + 3.05x. 
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as many population units as a 60-inch man in the sample. When the weights are 
incorporated, estimates of the parameters are computed as though there were actually 
w; data points with values (x;,y;). = 


1122 Standard Errors 


Let’s now examine the effect of the complex sampling design on the standard errors. 
As Bo and B, are functions of estimated population totals, methods from Chapter 9 
may be used to calculate variance estimates. 

For any method of estimating the variance, under certain regularity conditions an 
approximate 100(1 — a)% CI for B, is given by 


B + a/2Y V(B,), 


where f,/2 is the upper a/2 point of af distribution with df associated with the variance 
estimate. For linearization, jackknife, or balanced repeated replication (BRR) in a 
stratified multistage sample, we generally use (number of sampled psus) — (number 
of strata) as the df. 


11.2.2.1 Standard Errors Using Linearization 


The linearization variance estimator for the slope may be used because B, is a function 
of five population totals: from (11.4), By = h(ty, ty, ty, t2,N), where 


=f —b 
ibeags 2 
d-b’/e ed—-b 
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Using linearization, then, as you showed in Exercise 14 from Chapter 9, 


A oh. oh oh. oh oh 
V(Bi)~V da (thy ty)+— pes t+ oy b+ qt t2) oh Se (N —N) 


=V ic = (ts)? oa > wilyi — Bo = B,x;) (xi = ww} 


ieS 
Define 


gi = (yi — By — Bix — 0, 


where x = i; /N . Then, we may use 
(me) 
ieS 


(E=) 
a 


icS 


VB) = (11.8) 


2 


to estimate the variance of By. 
Note that the design-based variance estimator in (11.8) differs from the model- 
based variance estimator in (11.3), even if an SRS is taken. In an SRS of size n, 


2 
a a n Ss 
4 igi) = Vit = (1-=)nr4 
(Ss) (4) N n 
ieS 

with 


Yo Oi — ¥s)° i — Bo — Bix’. 
icS 
Thus, for an SRS, (11.8) gives 


2 
ee 
47 n-1 


> i — 8) (yi — Bo — Bix;? 


nN ieS 
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t1(B) = (1-5) 
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but from (11.3), 
dX (i — Bo — Bix)” 


ae - x 


icS 


Vu (Bi) = 


Why the difference? The design-based estimator of the variance V;, comes from 
the inclusion probabilities of the design, while Vj comes from the average squared 
deviation over all possible realizations of the model. CIs constructed from the two 
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variance estimates have different interpretations. With the design-based CI, By + 


ta/2\/ VB 1), the confidence level is )> u(S)P(S), where the sum is over all possible 
samples S that can be selected using the sampling design, P(S) is the probability that 
sample S is selected, and u(S) = 1 if the CI constructed from sample S contains 
the population characteristic B; and u(S) = O otherwise. In an SRS, the design- 
based confidence level is the proportion of possible samples that result in a CI that 
includes B,, from the set of all SRSs of size n from the finite population of fixed 
values {(%1, 91), (42,2), -- +, tn, Yw)}- 


For the model-based CI Bi  fy/21/ Vu(P1), the confidence level is the expected 
proportion of CIs that will include 6, from the set of all samples that could be 
generated from the model in (A1) to (A4). Thus the model-based estimator assumes 
that (A1) to (A4) hold for the infinite population mechanism that generates the data. 
The SRS design of the sample makes assumption (A3) (uncorrelated observations) 
reasonable. If a straight line model describes the relation between x and y, then (A1) 
is also plausible. A violation of assumption (A2) (equal variances), however, can have 
a large effect on inferences. The linearization design-based estimator of the variance 
is more robust to assumption (A2), as explored in Exercise 18. 


For the SRS in Example 11.2, the model-based and design-based estimates of the vari- 
ance are quite similar, as the model assumptions appear to be met for the sample and 
population. For these data, Bi = Bi because w; = 3000/200 for all i; V(B)) = = 0.048 
and VulBi) = = (0.2217)? = 0.049. In other situations, however, the estimates of 
the variance can be quite different; usually, if there is a difference, the lineariza- 
tion estimate of the variance is larger than the model-based estimate of the variance 
because the linearization estimate in (11.8) is valid whether the model is “correct” 
or not. 

For the unequal-probability sample of 200 criminals in Example 11.3, define the 
new variable 


gi = (vi — Bo — Bix) (x; — X) = (y; — 30.1859 — 3.0541x;)(x; — 11.51359). 


(Note that x = 11.51359 is the estimate of Xy calculated using the unequal probabil- 
ities; the sample average of the 200 x;’s in the sample is 11.2475, which is quite a bit 
smaller.) Then V( bee wigi) = 238,161, and 


2 
ieS 
Swag - 
ieS 1S Wi 


icS 
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= 688,508, 


so Vi(B1) = 0.35. If the weights are ignored, then the ordinary least squares analysis 
gives Bi = 1.79 and Vu(B1) = 0.05. The estimated variance is much smaller using 
the model, but Bi is biased as an estimator of B;. Since an unequal-probability sample 
was taken, V,(B,) should be used, giving a 95% CI of [1.89, 4.22] for B). 

These calculations can be done in SAS PROC SURVEYREG, using the code on 
the website. The following output is obtained: 
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Standard 95% Confidence 
Parameter Estimate Error t Value Pr > |t| Interval 
Intercept 30.1858583 6.64323949 4.54 <.0001 17.0856787 43.2860379 
finger 3.0540995 0.58962334 5.18 <.0001 1.8913879 4.2168111 ao 
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11.2.2.2 Standard Errors Using Jackknife 


Suppose we have a stratified multistage sample, with weights w; and H strata. A total 
of nj, psus are sampled in stratum h. Recall (see Section 9.3.2) that for jackknife 
iteration j in stratum h, we omit all observation units in psu j and recalculate the 
estimate using the remaining units. Define 


Wi if observation unit i is not in stratum h 


— _ Jo if observation unit i is in psu j of stratum h 
Wi(hj) _ Nh 


i w; if observation unit i is in stratum / but not in psu j. 
Ny — 


Then, the jackknife estimator of the with-replacement variance of B, is 


Vix (Bi) = ee mh ys Bra — By, (11.9) 


h=1 j=l 


where Bi is defined in (11.6) and Bij is of the same form but with wj,) substituted 
for every occurrence of w; in (11.6). 


For our two samples of size 200 from the 3000 criminals, 
200 


ViK(B B= ag 1G) — 


where Bi is the estimated slope when observation j is deleted and the other obser- 
vations reweighted accordingly. The difference between the SRS and the unequal- 
probability sample is in the weights. For the SRS, the original weights are w; = 
3000/200; consequently, wij) = 200w;/199 = 3000/199 for i A j. Thus for the 
SRS, By is the OLS estimate of the slope when observation j is omitted. For the 
SRS, we calculate Vix(B,) = 0.050. 

For the unequal-probability sample, the original weights are w; = 1/77 and wig) = 
200w;/199 fori ¢ j. The new weights w,(j) are used to calculate Big) for each jackknife 
iteration, giving Vix(B,) = 0.461. The jackknife estimated variance is larger than the 
linearization variance, as often occurs in practice. SAS code for using the jackknife 
to estimate the variance is on the website. m= 


1123 Multiple Regression 


Now let’s give results for multiple regression in general. We rely heavily on matrix 
results found in linear models and regression books listed in the references at the end 
of the chapter. 


EXAMPLE 11.6 
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Suppose we wish to find a relation between y; and a p-dimensional vector of 
explanatory variables x;, where x; = [Xj1,%i2,... adsl s We wish to estimate the 
p-dimensional vector of population parameters, B, in the model y = x’ B. Define 


y1 x} 

y2 x5 
yu=| . and Xy = 

YN XV 


The normal equations for the entire population are 
X7,/XuB = Xiyu, 
and the finite population quantities of interest are, assuming that (X/,Xy)~! exists, 
B = (XpXu) 'Xpyu, 
the least squares estimates for the entire population. 

Both X7.Xy and XZ yy are matrices of population totals: X}Xy = ’M, xix? 
and Xiyy = ae xiy;. The (j, k)th element of the p x p matrix X7,Xy is ae XijXik> 
and the kth element of the p-vector XUyu is He XikYi- 

Thus, we can estimate the matrices X/,Xy and X/yy using weights. We esti- 
mate X7,Xy = Yee? by Dies WiXix!, and we estimate XP yy = °™_, xiy; by 
Dees w;x;y;. Then, analogously to (11.6) and (11.7), define the estimator of B to be 


-1 
B= (x va] Y> wixiyi. (11.10) 


ieS ieS 
Let 
=, Tp 
Qi = Xi(y; — x; B). 


Then, using linearization (see Exercise 20), 


v (B) = bs waa] 7 v (x va) (x val] 7 (11.11) 


ieS ieS ieS 
Cls for individual parameters may be constructed as 


B, x ty/ VBL), 


where f is the appropriate percentile from the rf distribution. 


Return to the NHANES data that we plotted in Section 7.4.2. Figure 7.21 displayed 
a trend line for body mass index plotted against age (see Exercise 25). We can, 
alternatively, fit a polynomial regression model to the sample. It appears from the 
data plots in Section 7.4.2 that a quadratic model might be reasonable to try. For this 
model, y; = body mass index (variable bmxbmi) for person i and x; = [anal 
where x; = age for person i. SAS code to fit this model, on the website, produces the 
output below. Other options in SAS software for regression, and additional output, 
are explained in the code. 
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Fit Statistics 


R-square 0.2906 
Root MSE 5.9326 
Denominator DF 15 


GI 
n 


timated Regression Coefficients 


Standard 
Parameter Estimate Error t Value Pr > |t| 
Intercept 15.2978480 0.21381337 71.55 <.0001 
age 0.5465938 0.01381176 39% 577 <.0001 
agesq -0.0051407 0.00015865 -32.40 <.0001 


The quadratic term is statistically significant in this model. (In fact, with large data 
sets, it is common to have many of the predictors be statistically significant because 
the sample size is so large.) From the output, the predicted regression model is 


3, = 15.30 + 0.55 age — 0.005 age”. 


This model is not a perfect fit to the data. An examination of the residual plots (see 
Exercise 22) shows a pattern in the residuals that indicates another model might 
provide a better summary of the data, and other models are explored in the SAS code on 
the website. The values Bo, Bi , and B estimate the population quantifies Bo, By, and 
By, which are the values that would minimize the sum of squares Sue (y; — Bo — 
Byage; — By age?) if the entire population were measured. Thus, the design-based 
estimates and standard errors are correct for inference about Bp, B,, and B, even if 
the model for the population is not perfect. 

SAS software estimates the value of R? for the data to be 0.2906. In regression 
with data from a random sample, R? is the percentage of variability in the data that is 
explained by the regression model. If we fit a regression model using ordinary least 
squares to every person in the population, we would have Ri, = | — SSW/SSTO, 
where SSW = ye 1 O1- Si and SSTO = oan 1 Oi- Ju). We We can estimate ae a 
SSTO using weights as SSW = = ies Mik wi(y; — i)? and SSTO = Dies WiOi — YY’, 
and estimate R7, by R=1- SSW/SSTO. / 


1124 Regression Using Weights versus Weighted Least 
Squares 


Many regression textbooks discuss regression estimation using weighted least squares 
as a remedy for unequal variances. If the model generating the data is 


Y¥=x) B+ 6; 


with e; independent and normally distributed with mean 0 and variance a then ¢;/0; 
follows a normal distribution with mean 0 and variance 1. The weighted least squares 
estimator is 


Bois = (x's xy xt y 
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with £=diag (o7,03,...,07). The weighted least squares estimator minimizes 
> Qi-x} By / a, and gives observations with smaller variance more weight in deter- 
mining the regression equation. If the model holds, then, under weighted least squares 
theory, 


Vur(Bwis) = (XPETEX)T. 


We are not using weighted least squares in this sense, even though our point 
estimator is the same: B is the value that minimizes )~ w;(yi—x} B)?. Our weights come 
from the sampling design, not from an assumed covariance structure. Our estimated 


‘ P : a—1 : : : 
variance of the coefficients isnot (X7% X)~', the estimated variance under weighted 
least squares theory, but is 


ae | -1 
(x waa!) 0 a Wii; _ x/B)| (x vas!) ; 


iceS iceS icS 


One may, of course, combine the weighted least squares approach as taught in 
regression courses with the finite population approach by defining the population 
quantities of interest to be 


B = (X7,U,'Xu) /XyUy'yu; 


thus generalizing the regression model. This is essentially what is done in ratio esti- 
mation, using Ly = diag (x1,%2,...,Xy), aS will be shown in Example 11.13. 


1125 Software for Regression in Complex Surveys 


Several software packages among those discussed in Section 9.6 will calculate regres- 
sion coefficients and their standard errors for complex survey data. Before you use 
software written by someone else to perform a regression analysis on sample sur- 
vey data, you should investigate how it deals with missing data. For example, if an 
observation is missing one of the x values, SAS PROC SURVEYREG excludes the 
observation from the analysis. If your survey has a large amount of item nonresponse 
on different variables, it is possible that you may end up performing your regres- 
sion analysis using only twenty of the observations in your sample. You may want to 
consider the amount of item nonresponse as well as scientific issues when choosing 
covariates for your model. 

Some surveys do not release enough information in the public use files to allow you 
to calculate estimated variances for regression coefficients. The public-use data set 
from the Current Population Survey, for example, contains weights for each household 
and person in the sample, but does not provide clustering information. Such surveys, 
however, often provide information on deffs for estimating population totals. In this 
situation, you can estimate the regression parameters using the provided weights. 
Then estimate the variance for the regression coefficients as though an SRS were 
taken, and multiply each estimated variance by an overall deff for population totals. 
In general, deffs for regression coefficients tend to be (but do not have to be) smaller 


11.3 Using Regression to Compare Domain Means Aah 


than deffs for estimating population means and totals, so multiplying estimated vari- 
ances of regression coefficients by the deff often results in a conservative estimate 
of the variance (see Skinner, 1989). Intuitively, this can be explained because a good 
regression model may control for some of the cluster-to-cluster variability in the 
response variable. For example, if part of the reason households in the same cluster 
tend to have more similar crime victimization experiences is the average income level 
of the neighborhood, then we would expect that adjusting for income in the regression 
might account for some of the cluster-to-cluster variability. The residuals from the 
model would then show less effect from the clustering. 


1.3 


Using Regression to Compare Domain Means 


We often want to compare subgroups in a population. In Exercise 21 of Chapter 6, 
you showed that the method used to compare domain means in an SRS (namely, 


to form a statistic on - ¥2)/ VO) + VG») and compare that to a f distribution) 
can give incorrect inferences with data from a complex survey. If clusters con- 
tain units from both domains, then v1 and Yo are correlated so that VO1 - i) x 
VO.) + VO2). 

But have no fear—we can use regression to compare domain means and to fit 
one-way and factorial analysis of variance (ANOVA) models. To compare the means 
for two domains which together comprise the entire population, define a new variable 
x with x; = 1 if observation i is in domain 1 and x; = 0 if observation 7 is in domain 
2. Then the population slope B, in a straight-line regression model is By = Viz — Yo 
and Bi — yy _ yy (see Exercise 17). Consequently, V(B,) = Vor _ yo), and the 95% 
CI for B, is the 95% CI for the difference in domain means, yz, — yoy. 


EXAMPLE 11.7 Let’s compare the mean value of body mass index for men and women using the data 
in nhanes.dat (see Section 7.4.2 and Example 11.6). Create a variable x with x; = 1 
if person i is female and x; = 0 if person i is male. Then fit the model y = By + Byx. 
Partial output, from SAS code given on the website, follows: 


Estimated Regression Coefficients 


Standard 95% Confidence 
Parameter Estimate Error t Value Pr > |t| Interval 
Intercept 26.0044123 0.11636359 223.48 <.0001 25.7563891 26.2524354 
x 0.3577122 0.20979752 eT 0.1088 -0.0894606 0.8048851 


The slope By is the difference in domain means Dismal - Ymale = 26.362 — 26.004. 
A 95% CI for Yu-female — Yu.male 18 [—0.089, 0.805]. The difference in means is not 
significant at the 0.05 level. m= 


Comparing k domain means is similar. You need to define k — 1 indicator variables, 
with x; = 1 if observation i is in domain j and 0 otherwise, for j = 1,...,(k — 1). 
The kth domain mean is estimated by Bo. 


ANG Chapter 11: Regression with Complex Survey Data 


FIGURE 11.6 
Boxplot of body mass index for race-ethnicity groups defined by NHANES, incorporating the 
sampling weights. 
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EXAMPLE 11.8 Let’s compare the mean value of body mass index for the five ethnic groups (variable 
ridreth2) defined in the NHANES data. A side-by-side boxplot for these domains 
is shown in Figure 11.6. Output, constructed from SAS code given on the website, 


follows: 

Tests of Model Effects 
Effect Num DF F Value Pr > F 
Model 4 18.02 <.0001 
Intercept 1 25471.5 <.0001 
RIDRETH2 4 18.02 <.0001 


NOTE: The denominator degrees of freedom for the F tests 
is 15. 


The F statistic for the null hypothesis that the mean body mass index of all five groups 
is the same is 18.02, indicating significant differences among the groups. You can also 
do pairwise comparisons of group means, and adjust the p-values for multiple testing 
by using a multiple comparisons method such as Bonferroni. s 


11.4 Should Weights Be Used in Regression? 44] 


4 
Should Weights Be Used in Regression? 


In most areas of statistics, a regression analysis generally has one of three purposes: 


1 Itdescribes the relationship between two or more variables. Of interest may be the 
relationship between family income and the infant’s birth weight or the relationship 
between education level, income, and likelihood of being a victim of violent crime. 
The interest is simply in a summary statistic that describes the association between 
the explanatory and response variables. 


2 It predicts the value of y for a future observation. If we know the values for a 
number of demographic and health variables for an expectant mother, can we 
predict the birth weight of the infant, or the probability of the infant’s survival? 


3 It allows us to control future values of y by changing the values of the explanatory 
variables. For this purpose, we would like the regression equation to give us a 
cause-and-effect relationship between x and y. 


Survey data can be used for the first and second purposes, but they generally 
cannot be used to establish definitive causal relationships among variables.! Sample 
surveys generally provide observational, not experimental, data. We observe a subset 
of possible explanatory variables, and these do not necessarily include the variables 
that are the root causes of changes in y. In a health survey intended to study the 
relationship between nutrition, exercise, and cancer incidence, survey participants 
may be asked about their diet and exercise habits (or the researcher may observe them) 
and be followed up later to see whether they have contracted cancer. Suppose that a 
regression analysis later indicates a significant negative association between vitamin 
E intake and cancer incidence, after adjusting for other variables such as age. The 
analysis only establishes association, not causation; you cannot conclude that cancer 
incidence will decrease if you start feeding people vitamin E. Although vitamin E 
could be the cause of the decreased cancer incidence, the cause could also be one 
of the unmeasured variables that is associated with both vitamin E intake and cancer 
incidence. To conclude that vitamin E affects cancer incidence, you need to perform 
an experiment: randomly assign study participants to vitamin E and no-vitamin-E 
groups, and observe the cancer incidence at a later time. 

The purpose of a regression analysis often differs from that of an analysis to esti- 
mate population means and totals. When estimating the total number of unemployed 
persons from a survey, we are interested in the finite population quantity t,; we want 
to estimate how many persons in the population in August 2004 were unemployed. 
But in a regression analysis, are you interested in By and B,, the summary statistics 
for the finite population? Or are you interested in uncovering a “universal truth”—to 
be able to say, for example, that not only do you find a positive association between 


' Many statisticians would say that survey data cannot be used to make causal statements in any shape or 
form. Experimental units must be randomly assigned to treatments in order to infer causation. Some 
surveys, however, such as the study in Example 8.2, include experimentation, and for these we can often 
conclude that a change in the treatment caused a change in the response. 


AN Chapter 11: Regression with Complex Survey Data 


amount of fat in diet and systolic blood pressure for the population studied, but that 
you would expect a similar association in other populations. Cochran notes this point 
for comparison of domain means: “It is seldom of scientific interest to ask whether 
[the finite population domain means are equal], because these means would not be 
exactly equal in a finite population, except by rare chance. Instead, we test the null 
hypothesis that the two domains were drawn from infinite populations having the same 
mean” (1977, p. 39). Comparing domain means is a special case of linear regression, 
and Cochran’s comments apply equally well to linear regression in general. 

Many survey statisticians have debated whether the sampling weights are relevant 
for inference in regression; some of the papers discussing the issue are listed in 
the references at the end of this chapter. These references provide a much deeper 
discussion of the issues involved than we present in this section; we try to summarize 
the various approaches and present the contributions of each to a good analysis of 
survey data. 

Two basic approaches have been advocated: 


1 Design-based. The design-based approach was presented in Section 11.2. The 
quantities of interest are the finite population characteristics B, regardless of how 
well the model fits the population. Inferences are based on repeated sampling from 
the finite population, and the probability structure used for inference is that defined 
by the random variables indicating inclusion in the sample. A model that generates 
the data may exist, but we do not necessarily know what it is, so the analysis does 
not rely on any theoretical model. Weights are needed for estimating population 
means and totals, and by analogy should be used in linear regression as well. 


2 Model-based. A stochastic model describes the relation between y; and x; that holds 
for every observation in the population. One possible model is Y;|x; = x] B + &;, 
with the ¢;’s independent and normally distributed with constant variance. If the 
observations in the population really follow the model, then the sample design 
should have no effect as long as the inclusion probabilities depend on y only 
through the x’s. The value B is merely the least squares estimate of 6 if values 
for the whole population were known; since only a sample is known, one should 
use the ordinary least squares estimators 


-1 
Bots = (x sat) Yo xy. 
ieS iceS 
One searches for a model that can be thought to generate the population and then 
estimates the parameters for that model. 


Sarndal et al. (1992) adopt a model-assisted approach; for that approach, a model 
is used to specify the parameters of interest, but all inference is based on the survey 
design. Thus you fit a particular model because you believe it a plausible candidate for 
generating the population, but use the sampling weights to estimate the parameters 
and the sample design to estimate variances of the estimate. As inference is made 
using the sample design, we consider the model-assisted approach to be part of the 
design-based approach in this section. 

The distinction among the approaches is important for the survey analyst because 
most regression programs use either a design-based or a model-based approach. 
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SAS PROC REG or the R function /m assume a model-based approach to regres- 
sion, as exposited in Section 11.1. Survey software such as SUDAAN, WesVar, and 
SAS PROC SURVEYREG estimate the finite population parameters using the 
approach in Section 11.2. Thus it is important for you to know which approach you 
wish to take. Blindly running your data through software, without understanding what 
the software is estimating, can lead you to misinterpret the results. 

Most statisticians agree that it is a good thing if a regression model describes 
the true state of nature. If it were known that a model would describe every possible 
observation involving x and y, then that model should be adopted. In the physical 
sciences, many models such as force = mass x acceleration can be theoretically 
derived. As long as you stay away from near-light velocity, any observation for which 
force, mass, and acceleration are accurately measured should be fit by the model. 
The design for how observations are sampled, then, should make little difference for 
finding the point estimates of regression coefficients, as every possible observation is 
described by the model.” 

Unfortunately, theoretically derived models known to hold for all observations 
do not often exist for survey situations. An economist may conjecture a relationship 
between number of children, income, and amount spent on food, but there is no 
guarantee that this model will be appropriate for every subgroup in the population. 
Other variables may be related to the amount spent on food (such as educational level 
or amount of time away from home) but not measured in the survey. In addition, the 
true relation among the variables might not be exactly linear. Thus the main challenge 
to model-based inference is specifying the model. 

If taking a model-based approach, then, you need to examine the model assump- 
tions carefully and do everything you can to check the adequacy of the model for your 
data. This includes plotting the data and residuals, performing diagnostic tests, and 
using sampling designs that allow estimation of alternative models that may provide 
a better description of the relationship between variables. (Of course, you should also 
plot the data if adopting a design-based approach.) Inferences about observations not 
in the sample are based solely on the assumption that the model you have adopted 
applies to them, and you need to be very careful about generalizing outside of the 
sampled data. You must assume that the nonsampled population units can also be 
described by the model, and this is a very strong assumption. 

Much is attractive about the model-based approach for regression: It links with 
sociological theories of the investigator, is consistent with other areas of statistics, and 
provides a mechanism for accounting for nonresponse. The model-based approach 
provides a framework for comparing theories about structural relationships. In addi- 
tion, model-based estimates can be used with relatively small samples and with non- 
probability samples. Although design-based inference does not depend on model 
assumptions, it does require large sample sizes in practice to be able to construct CIs. 
The standard errors of the model-based parameter estimates are generally lower than 
those of the corresponding design-based estimates. 

But model misspecification and omitted covariates are of concern for a model- 
based analysis, and missing covariates may not show up in standard residual analyses. 
Moreover, in a complex survey design, the needed missing predictors may be related 


2The sampling design, however, can affect the variances of the point estimates. 


EXAMPLE 11.9 
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to the design and the survey weights. For example, for our unequal-probability sample 
in Figure 11.3, the inclusion probabilities we used depended on the value of y. Now, 
you can think of height as being determined by many, many variables x,, x2, ...; but 
the data set has only one of those possible explanatory variables. If all the other 
variables were included in the model, then the unequal selection probabilities would 
be irrelevant; because they are not, however, the inclusion probabilities 7; have useful 
information for estimating the regression slope. 

Pfeffermann and Holmes (1985) and DuMouchel and Duncan (1983) argue that 
using sampling weights in regression can provide robustness to model misspecifica- 
tion: The weighted estimates are relatively unaffected if some independent variables 
are left out of the model.* Kott (1991) argues that sampling weights are needed in 
linear regression because the choice of covariates in survey data is limited to variables 
collected in the survey: If necessary covariates are omitted, Band Bots are both biased 
estimators of 8, but the bias of Bisa decreasing function of the sample size, while 
Bois is only asymptotically unbiased if the probabilities of selection are not related to 
the missing covariates. Rubin (1985), Smith (1988), and Little (1991) adopt a model- 
based perspective but argue that sampling weights are useful in model-based inference 
as summaries of covariates describing the mechanism by which units are included 
in the sample. Rubin-Bleuer and Kratina (2005) provide a rigorous mathematical 
framework for inference under both model-based and design-based approaches. 

One point is clear: If the model you are using really does describe the mechanism 
generating the data, then the finite population quantity B should be close to the theo- 
retical parameter 6. Thus, if the model is a good one, we would expect that the point 
estimate of 6 using the model should be similar to the point estimate B calculated 
using sampling weights. We suggest fitting a model both with and without weights. 
If the parameter estimates differ, then you should explore alternatives to the model 
you have adopted. A difference in the weighted and unweighted estimates can tell 
you that the proposed model does not fit well for part of the population. Lohr and Liu 
(1994) explore this issue for the NCVS. 


Korn and Graubard (1995) illustrate the difference that including weights can make 
in a regression analysis, using data from the live-birth component of the 1988 MIHS. 
As mentioned in Example 11.1, black infants and low-birth-weight infants are over- 
sampled, so their sampling weights are lower than the weights for white, normal- 
birth-weight infants. Figure 11.7 shows a plot of the data and estimated regression 
line when weights are used in calculating the regression parameters; Figure 11.8 
ignores the weights. The weighted regression pulls the regression line to where the 
population is estimated to be; in the unweighted regression, the line provides the best 
least-squares fit to the sample data, but does not describe the population as well. It 
is clear from examining the plots that the regression lines differ to such an extent 
because a straight-line model is not appropriate for the data; if a quadratic regres- 
sion were fit instead, then the models from the weighted and unweighted regressions 
would show greater agreement. In this example, then, the differences between the 


3But this robustness comes at a price; as mentioned earlier, the design-based variance is generally larger 
than the model-based variance. Kish (1992) gives a good overview of the variance inflation due to using 
weighted estimates rather than estimates without weights. 
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FIGURE 11.7 

Plot of weighted mean gestational age versus weighted mean birthweight for successive groups 
of approximately 500 observations. Areas of bubbles are proportional to the estimated popu- 
lation sizes of the groups. The straight line is the weighted linear regression fit to the original 
(ungrouped) data. 

Source: From “Examples of Differing Weighted and Unweighted Estimates from a Sample Survey,” 


by E.L. Korn and B.I. Graubard, 1995, The American Statistician, 49, pp. 291-295. Copyright © 1995 
American Statistical Association. Reprinted by permission. 
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parameter estimates with weights and without weights arise because the straight-line 
model adopted is inappropriate. = 


Each of the approaches to inference about regression parameters in complex sur- 
veys can be appropriate, depending upon the desired use of the regression model. You 
may want to consider the following questions when deciding upon your approach: 


1 Are you performing a regression to generate official statistics that will be used 
to determine public policy? If so, you may want to use the weights to estimate 
parameters and the design to make inferences about the parameters. If you are 
using weights to estimate population and domain means, you may also want to 
use them to estimate regression parameters, so that the results from different 
analyses are consistent (see Alexander, 1991). As noted above, B should be close 
to B for a good model and large finite population, so a design-based estimate of 
B should also estimate f. 
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FIGURE 11.8 

Plot of weighted mean gestational age versus weighted mean birthweight for successive groups 
of approximately 500 observations. Areas of bubbles are proportional to the sample sizes of 
the groups. The straight line is the unweighted linear regression fit to the original (ungrouped) 
data. 

Source: From “Examples of Differing Weighted and Unweighted Estimates from a Sample Survey,” 


by E.L. Korn and B.I. Graubard, 1995, The American Statistician, 49, pp. 291-295. Copyright © 1995 
American Statistical Association. Reprinted by permission. 
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2  Wasaprobability sample taken? If not, then you must use a model-based approach. 


3 How large is the sample size? The design-based theory relies on large sample sizes 
to make inferences about the parameters. If you have a small sample, you should 
probably use a model-based approach. 


However, a mistake is often made by investigators who have heard the message 
that sampling weights are irrelevant in regression analysis but have ignored the rest 
of the discussion: They ignore the weights and the clustering in the data by simply 
running the survey data through standard regression software. This is incorrect under 
any approach: Whether or not weights are used to construct an estimator, the depen- 
dence in the data reflected in the clustering must be considered when calculating 
standard errors. A model-based approach that incorporates the positive correlation 
among observations in the same cluster is discussed in the next section. 
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1.5 
Mixed Models for Cluster Samples 


In Chapter 5 we discussed using a random effects model as a superpopulation model 
for cluster sampling. We can use this approach for regression analyses as well, by 
allowing different clusters to have their own regression equations, but relating the 
different regression equations for the clusters through a model. 


EXAMPLE 11.10 The National Assessment of Educational Progress (NAEP) collects data on student 
background and achievement in the United States. It is sometimes referred to as “The 
Nation’s Report Card,” as it provides a scale for measuring student progress and com- 
paring student achievement among different states and over time. A wealth of infor- 
mation is collected for each student, teacher, and participating school. In addition to 
proficiency scores for various subjects, the student-level data include information on 
the student’s gender, race, ethnicity, courses taken, and variables related to socioeco- 
nomic status. School-level information includes fiscal resources, instructional meth- 
ods, student-body characteristics and expectations of academic achievement. 

The NAEP data can be used to identify school- and student-level variables that 
are associated with mathematics achievement among eighth-grade students. For sim- 
plicity, let’s consider one student-level characteristic, gender; and one school-level 
characteristic, average amount of time spent in class on math tests. In practice, of 
course, you would probably include more variables in the model, as you would expect 
a number of characteristics to be associated with the tested mathematics achievement. 
Let Y;; be the mathematics proficiency score of student j at school i in the sample, and 
let x;; = 1 if student j at school i is female, and 0 if student j and school i is male. 

We expect a clustering effect in these data—measuring all variables that might 
be associated with student achievement scores in mathematics is impossible, and 
the unmeasured characteristics of the schools, teachers, and neighborhoods induce a 
positive correlation in the test scores within a school. For example, the seventh- and 
eighth-grade mathematics teacher in one school might be superb at inspiring students 
to learn mathematics, but that excellence would not be recorded in the survey. The 
students from that class might then all perform better than average on the proficiency 
test, so their scores are more similar, even after adjusting for known covariates, than 
scores of arandom sample of students from the population. When unmeasured charac- 
teristics such as these are considered over all schools, the result is a positive intraclass 
correlation coefficient. 

Thus a model Yj = £o + xj8, + &j, with ¢j independent random variables 
with mean 0 and variance o7, is likely to be inappropriate for these data. If this 
inappropriate model is adopted, the calculated p-values for parameter estimates will 
be far too small. In addition, the model does not allow for different relations between 
gender and test score in different schools—which may occur, as some schools may 
encourage students of one gender more than students of the other gender. 

A model that incorporates cluster effects and allows schools to have different 
slopes for gender is: 


¥y = Poi + (iy HDB + ey 
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Here, the ¢,’s are assumed to be independent N(0, o”) random variables; the mean 
of xj for school i, x;, is subtracted from each xj so that Bo; can be interpreted as the 
average test score in school 7. School i has its own straight line regression model with 
intercept 6; and slope 6,;. But the slopes and intercepts from different schools are 
also related through a model. A simple model for the slopes and intercepts allows 
them to essentially be randomly distributed about a mean: 


Boi = Bo + 501; Bri = Bi + S13, 


with do; and 6,; following a bivariate normal distribution with Ey[5o;] = Ey[61;] = 0, 
Vul6oi] = Too, Vu [di] = T11, and Cov (6;, 5;;) = T,. Under this situation, then the 
model may be written as 


Yi = Bo + yj — Xi)B1 + Soi + Hy — X)O1i + 8y- (11.12) 


The parameter 6p represents the mean test score for schools; 6; represents the mean 
slope for gender for schools. The random effects 59; and 61; represent the difference 
in the intercept and slope between school i and the average values for intercept and 
slope for all schools; they measure the school effect. Finally, ¢; refers to additional 
deviation from the mean due to the individual student, after the effect of gender and 
school have been accounted for. 

Note that if to9 = t,;; = 0, there is no school effect on test score, and the model 
then reduces to a regular straight line regression model. In most applications, however, 
the slopes and intercepts will vary from school to school. = 


The model in (11.12) is an example of a mixed linear model; it has both fixed 
(Bo and 6) and random (49;, 5);, and ¢j) coefficients. In econometrics, (11.12) is 
often referred to as a random-coefficient regression model; in the social sciences, 
it is called a multilevel or hierarchical linear model. Demidenko (2004) and Jiang 
(2007) describe the theory of mixed models. Pfeffermann et al. (1998) and Rabe- 
Hesketh and Skrondal (2006) discuss using mixed models with survey data. These 
models may be fit in SAS PROC MIXED or in specialized software packages. 

The mixed model in (11.12) is a superpopulation model and is assumed to hold 
for all schools and students in the population. One advantage of using such a model 
is that it does not require that the schools be randomly selected, as long as the model 
describes the population. A mixed model approach is also congenial to testing different 
theories about mathematics education. 

The model in (11.12) may also be used as a starting point for further investigation. 
The random effects 59; and 6;; may be estimated for each school; the investigator 
may want to examine schools with unusually high or low values to try to conjecture 
why those schools might be different. The investigator may also want to include other 
predictor variables when estimating the intercepts and slopes for the different schools. 
For example, it might be conjectured that having more math tests at a school might 
lead to better mathematics proficiency scores, and might also lead to a smaller gender 
difference in the school. This extra predictor can easily be included in the mixed 
model. Let z; be the average amount of time spent on math tests at school i. Then the 
intercept and slope at school i can be modeled as 


Boi = Bo + vou + 4013 Bri = Br + Mz + Sis 
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yo then represents the effect of time spent on math tests on the intercept, and 40; 
represents the remaining school effect after adjusting for z;. 


Logistic Regression 


EXAMPLE 11.11 


In linear regression, the response variable is usually considered to be approximately 
continuous—for example, birth weight, income, or leaf area. In surveys, however, 
many variables of interest are dichotomous, with y; taking only values of 1 (yes) or 
0 (no). Logistic regression (see Hosmer and Lemeshow, 2000, for a general reference) 
is often used to predict probabilities of having response | for dichotomous variables. 

First let’s review logistic regression from a model-based viewpoint. Let x be a 
vector of explanatory variables and £ be the vector of unknown parameters. Then the 
standard logistic regression model takes the form 


exp (x! B) 
1 + exp (x7 B)’ 


where p(x) represents the probability that a unit with covariates x will have a response 
of 1. Alternatively, the model may be expressed in logit scale, where logit(p) = 


In[p/(1 — p)]: 


p(x) = (11.13) 


logit[p(x)] = x’ B. (11.14) 


For the data in Example 10.1, let y; = 1 if household 7 has a computer and y; = 0 if 
household i does not have a computer. Let x; = 1 if household i subscribes to cable 
and x; = 0 if household i does not subscribe to cable. The fitted logistic regression 
model is 


logit [p;] = —0.177 — 0.2813;. 
Note that the slope, —0.28 1, is the log odds ratio from Example 10.1. It is easy to trans- 
form back to predicted conditional probabilities: When x = 1, then In [p/(1 — p)] = 
—0.4573184 so that 


7 exp (—0.4573184) 119 
pl) = 


= 0.388 = —_. 
1 + exp (—0.4573184) 307 


SAS code for calculating the parameter estimates is on the website. m= 


Much of the previous discussion in this chapter on linear regression also applies 
to logistic regression—a complex sample design will affect standard errors of the 
logistic regression coefficients, just as it affects standard errors of the linear regres- 
sion coefficients. Logistic regression with one dichotomous independent variable is 
essentially equivalent to finding the odds ratio in a2 x 2 contingency table, so the dis- 
cussion in Chapter 10 about how the sampling design affects standard goodness-of-fit 
tests also applies to testing the significance of logistic regression coefficients. 

Binder (1983), Chambless and Boyle (1985), and Roberts et al. (1987) give design- 
based theory for estimating logistic regression parameters. Just as the design-based 
theory for linear regression started with defining the population quantities of interest 


EXAMPLE 11.12 


46 
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using the normal equations, here the quantities of interest are defined in terms of the 
likelihood function that would be adopted if the entire population were available for 
study. If there are N units in the population, this likelihood (assuming independence) is 


N 
L(B) =| [pd - pi’, (11.15) 
i=l 
where p; = exp(x/ B)/[1 + exp(x/ B)] represents the probability that a unit with 
covariates x; has a response of 1. The finite population parameter B is then defined 
to be the maximum likelihood estimate of B using (11.15). The parameter B is the 


solution to the system of equations 


N 
Dau 
i=1 


if all elements in the population could be observed. 
Now that B is defined, calculate B by substituting estimators for the population 
totals in (11.16). A design-based estimator of B is given by the solution B to 


ye WiXij E = 


ieS 


exp (x/ B) 


to 11.16 
1 + exp (x/B) ( ) 


|e forj=1,...,p 


exp (x/B) 


ce At Le (11.17) 
1 + exp (x B) 


=0 forj=1,...,p, 
where S denotes the units included in the sample. The ith observation in the sample 
represents w; observations in the population. 

Variance estimation for logistic regression is discussed in the references cited 
above. The coefficients B are defined implicitly in (11.17), so a linearization variance 
estimator may be obtained using methods in Binder (1983). Rao et al. (1998) present 
a modified version of score tests for testing the significance of logistic regression 
coefficients. Any of the resampling methods in Chapter 9 may be used to estimate the 
variance of logistic regression coefficients. 


Consider using logistic regression to predict the event that body mass index > 25 
from the triceps skinfold measurement (variable bmxtri), using the data in nhanes.dat. 
Partial output from SAS PROC SURVEYLOGISTIC code on the website is given 
below; other output given by SAS is explained in comments in the code. 


Analysis of Maximum Likelihood Estimates 
Standard Wald 
Parameter DF Estimate Error Chi-Square Pr > ChiSq 
Intercept 1 -2.6802 0.1237 469.1545 <.0001 
BMXTRI 1 0.1496 0.00751 397.3564 <.0001 
Odds Ratio Estimates 

Point 95% Wald 
Effect Estimate Confidence Limits 
BMXTRI Lh6 1 1.144 LAL 9 
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The Wald test used for coefficients compares B / VB) to a chi-square distribution 
with | df (see Section 10.3.1). m= 


Logistic regression has one important difference from linear regression. In Sec- 
tion 11.2, we noted the bias that can occur in estimating linear regression parameters 
if the inclusion probabilities are related to the response variable, but the unequal prob- 
abilities are not accounted for in the analysis. In a health survey, for example, blood 
pressure might be used as a stratification variable, and a higher sampling fraction 
used in the high-blood-pressure stratum than in the low-blood-pressure stratum. If 
we ignore the unequal probabilities and fit a linear regression model predicting the 
continuous variable blood pressure from covariates such as age, diet, and smoking 
history, the regression coefficients may be biased for estimating B. 

Prentice and Pyke (1979), however, show that if a logistic regression model is 
valid and contains an intercept term, then the intercept is the only parameter esti- 
mate affected by a sample design that depends on the y’s. Such sample designs are 
particularly common in epidemiology and economics, where they are referred to as 
case-control studies and choice-based sampling. In an epidemiology application, the 
population may be divided into two strata: persons with lung cancer, and persons 
without lung cancer. A sample is selected from each stratum; as lung cancer is rare, 
the stratified sample has a far greater sampling fraction (and lower sampling weights) 
in the cancer stratum than in the non-cancer stratum. But if the primary interest is in 
estimating the coefficients of age, diet and smoking history in a logistic regression, 
the disproportionate sampling makes no difference in a model-based analysis. We 
would expect that if the model is good, the only difference between a weighted and 
unweighted analysis would appear in the intercept terms. Of course, if a cluster sample 
is used, the dependence of the data induced by clustering will need to be considered 
in the logistic regression model for variance estimation, as discussed by Scott and 
Wild (2003) and Scott (2006). 


1.1 


Generalized Regression Estimation for 
Population Totals 


In Chapter 4 we introduced ratio and regression estimation in the setting of SRSs, 
with estimators 


ty 
4, y 
by = 7k 

ty 


and 
tyreg a ty + By (ty == ty). 


Now let’s extend these estimators to complex survey samples. We want to reduce 
the mean squared error of the estimator i = Vics wii by including auxiliary infor- 
mation through the working model 


Yi |x; =x; B+ &i, (11.18) 
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with x} = (XI, Xj2,-.-,Xip) and Vy (Ej) = oF for ar known. We assume that the vector 
of true population totals t, is known and thus can be used to adjust the estimator bys We 
allow the variances to differ so that ratio estimation fits into this general framework. 
Using a working model in (11.18), but relying on the sampling design for inference, is 
an example of the model-assisted approach further described in Sarndal et al. (1992, 
Chapters 6 and 7). 

Define 


B= (X)5,' Xv) XE yu: 


where Ly is a diagonal matrix with ith diagonal element o?. The finite population 
parameter B is the weighted least squares estimate of 6 for observations in the pop- 
ulation, using the model in (11.18). Thus the form of B is inspired by (11.18), but 
we then treat B as a finite population quantity to be estimated using information in 
the sample. Note that X7.5!Xy = 0M, xx? /o? and XP Ez!yy = We, xivi/0?. 
Thus, B may be estimated by 


-1 

i 1 1 

B= ;— xx! ;— XiVj- 11.1 

(x Wi oe) XX; Si Wj Pe XiVi ( 9) 
ieS " ieS i 

The generalized regression (GREG) estimator of the population total is 


i,GREG = ty + (ty — ty)" B, (11.20) 


where B is given in (11.19). The term (t, — t,)’B in (11.20) is aregression adjustment 
to the Horvitz-Thompson estimator, i = ics Wii. Note that 1yGREG is a weighted 
sum of the y; values in the sample: we can write 


iyGREG = S Wi8i Vis (11.21) 
ieS 
where 
=] 
2e1+& =) ae: =e 1122 
gi = 1+ (tk — tx) mas ot (11.22) 


The values g; are the adjustments to the weights made by using the regression esti- 
mator. For large samples, we expect t, to be close to ty so that g; will be close to 1 
for many observations. 

For any choice of the constants a, the GREG estimator calibrates the sample to 
the population total of each x variable used in the regression. To see this, look at the 
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GREG estimator of tx: from (11.21), 


tyGrec = ) Wi8i Xi 


iceS 
-1 
Z ‘x 1 
T T 
=t+ Se w; | (tx — tx) Ss Wj XjX; hes x; 
ieS jeS J . 
-1 
Z - T i 
=t+ yo wi zXi | X; Yow 5 XX; (ty — tx) 
ieS i jeS Fi 


Using linearization, 
V(iscrec) = Viiy + (ty — ,)"B] © VG, — tT). 


Let e; = y; — xB be the ith residual. Then the variance may be estimated by 


V1 (crea) = Vv (x vs ‘ 


icS 
An alternative estimator of the variance (see Exercise 21) is 
V2(tyGrEc) = V (x ve : 
icS 


If the model is a good one, we expect the variability in the residuals to be smaller 
than the variability in the original observations, so that the GREG estimator will be 
more efficient than by In an SRS, for example, 


; > 01-5 


a N N\ ieS 
VG, = (1 ) 
) n N n—-1 


but 


de 
e: 
2 L 


aA N NY ieS 
iG = (1 ) : 
(tyGREG) 7 W)neol 


if the residuals tend to be smaller than the deviations of y; about the mean, then the 
estimated variance is smaller for the GREG estimator. 


EXAMPLE 11.13 Ratio estimation. For ratio estimation, we adopt the working model 


yi = Pxite;, Vales) = 07x. 


EXAMPLE 11.14 
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The population quantity B is the weighted least squares estimate of 6 using the whole 
population. Then, using (11.19), 


» WiYi 


= es 
3 WiXiVi _ ieS _ 
cs wa 


ieS 


2 
B = (St 


icS : 


The generalized regression estimator of the population total is 


A A 


aS ‘ aby ty 
tyGREG = ty +(h- th) Hie, 
ty x 


which is the usual ratio estimator. = 


Poststratification. We discussed poststratification in Sections 4.4 and 8.5.2 as a 
method of calibrating estimates to population totals of subgroups and as a method 
of adjusting for nonresponse. Suppose we know the population counts NV, for C 


poststrata, c = 1,...,C. Define the variables x,, = 1 if observation unit i is in 
poststratum c and 0 otherwise, and let x; = [xj1,... »xic]’. Consider the working 
model 


Y; = Bixi + Boxig +--+ + Bexic + &: 
with Vy(e;) = 02. Then, 
o’ X72, Xu = XpXu = diag (M,...,Nc) 
and 
1 ‘ a 
2 T : 
isxix; = d Nj,...,Nc). 
oO DeMigg iag (N1 c) 


As aresult, B. — Ge /Nes where hye =o: ics WiXicyi 18 the estimated population total in 
poststratum c and N. = ies WiXic is the estimated population count in poststratum 
c. The generalized regression estimator is 


Cc 3 Cc N 

A A A ye CA 

tyGREG = ty + ) (Ne — Ne)a- = ) N Lye. . 
cl £ c=1 “°° 


Often, the auxiliary variables are useful for many of the response variables of 
interest. You may want to poststratify by age, race, and gender groups when esti- 
mating every population total for your survey. This is easily implemented because 
the generalized regression estimator is a linear estimator in y, as seen in (11.21): 
1yGREG = oe s Wigi yi. The weight adjustments g; in (11.22) depend on the x’s but 
they do not depend on values of the response variable. To estimate totals with the gen- 
eralized regression estimator, form a new column in the data with values a; = wj;g;. 
Then use the vector of a; as the weight vector for estimating the population total of 
any variable. 
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Chapter Summary 


In regression methods with complex survey data, the population characteristics of 
interest are B, the least squares or logistic regression coefficients that would be esti- 
mated if we knew the entire population. Since B is a function of population totals, it 
is estimated by B using the sampling weights. Ideally, the finite population values B 
reflect an underlying relationship between y and x, but inferences about B, using the 
survey design, are valid whether the regression model is a good one or not. 

The generalized regression estimator provides a method for using auxiliary infor- 
mation to reduce the mean squared error of estimators. It can also be used to reduce 
bias due to nonresponse. 


Key Terms 


Generalized regression estimator (GREG): An estimator of a population total that 
uses auxiliary information through a regression model. 

Model-assisted estimation: An approach to inference in which a population model 
motivates the form of estimators, but all inference is based on the survey design. 


For Further Reading 


Kutner et al. (2005) is a general reference on linear regression analysis for data 
assumed to be generated from a model (not survey data). Graybill (1976) and 
Ravishanker and Dey (2002) present theoretical results about regression models, 
again in the non-survey setting. 

If you want to learn more about inference in sample surveys, start with the paper by 
Brewer and Mellor (1973), who present an insightful and entertaining debate between 
“Harry,” a design-based survey statistician, and “Fred,” who is fresh from graduate 
school and promotes a model-based approach. The book by Brewer (2002) also con- 
trasts the approaches. Smith (1994) provides an interesting review of philosophies of 
inference, by a statistician whose previous work adhered to a model-based approach. 
Binder and Roberts (2003) discuss inference for regression models. Robinson (1987) 
studies another approach to inference in survey sampling, conditional design-based 
inference; references to earlier work are given in the paper. The model-assisted 
approach to inference that we use in this chapter is discussed in more detail by 
Sarndal et al. (1992). 

The theory for regression estimation in complex surveys has been developed by 
many people. Kish and Frankel (1974) is one of the first papers to show that the sample 
design affects estimates of regression parameters. Other references for further reading 
include Konijn (1962), Kalton (1983), Valliant et al. (2000), Fuller (1975, 1984, 2002), 
and Korn and Graubard (1999). Lehtonen and Pahkinen (2004) discuss linear and 
logistic regression in surveys, and present a case study of multi-level modeling in an 
educational survey. 

Sarndal (2007) gives a clear overview of generalized regression estimators and 
calibration in survey sampling. Estevao and Sarndal (2006) discuss a functional form 
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of calibration. Beaumont and Alavi (2004) derive a robust generalized regression esti- 
mator, and Breidt and Opsomer (2000) and Montanari and Ranalli (2005) use nonpara- 
metric regression methods in survey sampling. Gelman (2007) adopts a hierarchical 
Bayesian approach to weight adjustment and Valliant (2009) discusses a model-based 
approach to using auxiliary information when estimating population totals. Montanari 
(1987) and Rao (1994) present an alternative method of using regression for estimat- 
ing population totals. Silva and Skinner (1997) discuss methods for selecting the x 
variables in regression estimation. 


A. Introductory Exercises 


Read one of the articles listed in the file chapter! 1papers.html on the book website, 
or another article in which regression or logistic regression is used on data from a 
complex survey. Write a critique of the article. What is the purpose and design of the 
survey? What is the goal of the analysis? How do the authors use information from 
the survey design in the analysis? Do you think that the data analysis is done well? If 
so, why? If not, how could it have been improved? Are the conclusions drawn in the 
article justified? 


An investigator wants to study the relationship between a child’s age and number of 
siblings, and the dollar amount of the child’s Christmas list presented to Santa Claus. 
She also wants to estimate the total number of children that visit Santa Claus, and the 
total dollar amount of all childrens’ requests. It would be very difficult to construct 
a sampling frame of children who will visit Santa Claus between December | and 
December 24, but the investigator has a list of shopping malls and stores in which 
Santa will appear in the city, as well as the times that Santa will be at each location. 
The Santa sites are divided into four categories: 23 department stores, 19 discount 
stores, 15 toy stores, and 5 shopping malls. The investigator wants you to help design 
the sample of children. 


a What questions would you ask the investigator to clarify the problem? 


b Assuming any answers you like to the questions you asked, suggest a design for 
the survey. 


ce How will your survey design affect the regression analysis of the data? How do 
you propose to analyze the data? Are there other explanatory variables that you 
would suggest to the investigator? 


Use the data in file spanish.dat (see Exercise 5 of Chapter 5). Let domain | consist of 
students who are planning a trip to a Spanish-speaking country in the next year and 
domain 2 consist of the students who are not planning such a trip. We are interested 
in whether the mean vocabulary score (y) differs in the two domains. The population 
domain mean in domain | is yy; and the population domain mean in domain 2 is yy. 
Using regression, estimate yy; — yy2 and give a 95% CI. Is there evidence that the 
domain means differ? 
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B. Working with Survey Data 


Use the data in anthrop.dat for this problem. 


a Construct a population from the 3000 observations in anthrop.dat in which the 
1000 individuals with the highest value of y have been removed. Now take an 
SRS of size 200 from the remaining 2000 individuals, and plot the data along 
with the ordinary least squares regression line. How does this line compare to the 
population regression line? 


b Repeat (a), but use as the population the 2000 individuals with the lowest value 
of x. 


c Is there a difference in the regression equations in (a) and (b)? Explain, and relate 
your findings to the model in (11.1). 


Use the data in nybight.dat (see Exercise 18 of Chapter 3) for this problem. Using 
the 1974 data, estimate the coefficients in a straight line regression model predicting 
weight of the catch from the number of fish caught. Give standard errors for your 
estimates. Be sure to plot the data! 


Perform a model-based analysis for the setting in Exercise 5. Be sure to examine the 
residuals and postulate an appropriate variance structure for the model. 


Repeat Exercise 5 for predicting number of species caught from the surface 
temperature. 


Repeat Exercise 6 for predicting number of species caught from the surface 
temperature. 


Use the data in teachers.dat (described in Exercise 15 of Chapter 5) for this problem. 


a_ Estimate the coefficients in a straight line regression model predicting preprmin 
from size. Give standard errors for your estimates. Is there evidence that the two 
variables are related? (Be sure to plot the data!) 


b Perform a model-based analysis of the same data. Be sure to examine the residuals 
and postulate an appropriate variance structure for the model. 


Use the data in books.dat (described in Exercise 8 of Chapter 5) for this problem. 
a Plot replace vs. purchase for the raw data. 
b_ Plot replace vs. purchase using the sampling weights. 


ce Using a design-based approach, estimate the regression equation for predicting 
replace from purchase, along with its standard error. How many df would you use 
in constructing a CI for the slope? 


For the situation in Exercise 10, postulate a model for the variance structure. Using 
your model, estimate the slope of the regression line predicting replace from pur- 
chase. How do your estimate and its standard error compare with your answers in 
Exercise 10? 


Use your data set from Exercise 13 of Chapter 3 for this problem. Using the weights, 
fit a regression model predicting acres92 from largef92. Give a standard error for the 
estimated slope. Now ignore the sampling design, and calculate the ordinary least 
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squares estimate of the slope. Do your point estimates differ? Explain why or why 
not by examining plots of the data. 


Lush (1945, p. 95) discussed different estimates of heritability for milk fat percentage 
in dairy cattle herds. Heritability is defined to be the percentage of variability in fat 
percentage that is attributable to differences in the heredity of different individuals; the 
remainder of the variability is attributed to differences in environment. He noted that 
when the herd was treated as an SRS, the estimate of heritability was about 0.8; when 
fat percentage for daughters was regressed on fat percentage for dams, and where 
each dam was represented by only one record, the estimate of heritability decreased 
to below 0.3. 

From a sampling perspective, why are these estimates so different? Discuss how 
you would analyze the full herd data from both a design-based and a model-based 
perspective. 


Using the data in nhanes.dat, fit a straight line regression model predicting y = triceps 
skinfold (variable bmxtri) from x = body mass index (variable bmxbmi). You plotted 
these data in Exercise 15 of Chapter 7. Give a 95% CI for the slope, and calculate R? 
for these data. Draw your regression line on the plot. 


Using the data in nhanes.dat, fit a straight line regression model predicting y = waist 
circumference (variable bmxwaist) from x = thigh circumference (variable bmxthicr). 
You plotted these data in Exercise 16 of Chapter 7. Give a 95% CI for the slope, and 
calculate R? for these data. Draw your regression line on the plot. 


Using the data in ncvs2000.dat, fit a logistic regression model predicting whether a 
person is a victim of violent crime from age and sex. Is a quadratic term needed for age? 


C. Working with Theory 


Comparison of domain means. Suppose the population may be divided into two 
groups, with respective sizes N; and N2 and population means yy and yoy. The over- 
all population mean is yy = (Ni yiy + Noyoy)/N, with N = N; + No. Let x; = 1 
if observation unit i is in group 1, and x; = 0 if it is in group 2. The weight for 
observation unit 7 is w;. 

Show that B; = yiy — yoy and Bo = yoy. Also show that 


Sowa Swill - iyi 


as ieS ieS Es a 
By= =y-y2 
s WiX;i oD wi(1 — xi) 
ieS ieS 


and Bo = yo. 

Consider the SRS data in file uneqvar.dat. 

a Plotyvs. x. 

b_ Find the fitted regression line under the assumption of equal variances. 


ec Calculate Vu(Bi) and V.(A)). How do they compare? 


Show that (11.10) is equivalent to (11.6) and (11.7) for straight-line regression. 
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(Requires linear algebra and calculus.) The linearization estimator of V(B) can be 
found by the method outlined in Section 9.1 (see Shah et al., 1977). However, the cal- 
culations are easier using the Demnati—Rao (2004) method discussed in Exercise 23 
of Chapter 9. Show (11.11) using the Demnati—Rao method. HINT: Use the fact (see 
Harville, 1997, p. 307) that if F is a nonsingular matrix whose entries are functions 
of u, then 
ioe ap Ep 
ou ou 

(Requires linear algebra and calculus.) Use the Demnati—Rao (2004) method dis- 
cussed in Exercise 23 of Chapter 9 to estimate V(iycrEc) for tyGREG defined in (11.20). 
Hint: Use the matrix differentiation result given in Exercise 20. 


Plotting residuals in regression models with complex survey data. In a design-based 
framework for inference, the regression coefficients B estimate the population values 
B. Inferences such as CIs depend on the inclusion probabilities in the sampling design 
and thus do not depend on model assumptions. Design-based inferences about the 
finite-population regression parameters of a bad model are valid. Nevertheless, as dis- 
cussed in Section 11.4, we often are interested in an underlying theoretical model and 
want to assess how well the population regression model fits. We can plot residuals 
versus predicted values, incorporating the weights, using methods in Section 7.4.2. 
Plot the residuals versus predicted values for the regression model in Example 11.6. 
What do you see in your plot? How would you change the model? 


Regression diagnostics for complex survey data. Jenney (2005) and Li and Valliant 
(2006) independently developed regression diagnostics methods for complex survey 
data. The leverages for the population values are the diagonal elements of the matrix 


H = Xy(XjXv) ‘Xp, 
so that the leverage of unit i in the population is 
AU); = x} (XpXu) "xi. 


The leverage of an observation is a measure of the distance from x; to the means of 
the set of explanatory variables (Kutner et al., 2005). Using the weights, define the 
leverage of unit i in the sample as 
-1 
h(S); = wix; 2 W)X)X) xj. 
jes 
a Show that )°;..; h(S); = p, the number of parameters in the regression model. 
b Calculate the leverage, using the weights, for each observation in file anthuneq.dat, 


used in Section 11.2. Which points have the highest values of leverage? Does your 
assessment of the high-leverage points change if you do not use the weights? 


The diagnostic statistic DFFITS (Belsley et al., 1980) is often used to assess influen- 
tial observations in a data set. A complex survey version of DFFITS can be calculated 
using the survey-weighted leverage h(S); from Exercise 23 along with the residual 
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for observation i, e; = y; — i: 


prrits = vet 
1 — h(S); SEG) 


where 
V6;) = V(0xB) = x7 V(B)x;. 


Calculate DFFITS for each observation in file anthuneq.dat. 


(Requires theory of linear models.) Local polynomial regression with survey data. Sec- 
tion 7.4.2 discussed using smoothed trend lines in bivariate plots of survey data (see 
Korn and Graubard, 1998 and Bellhouse and Stafford, 2001). We posit a model y; = 
g(x;) + ¢;, where the second derivative of g is continuous, and estimate the underlying 
smooth function g(x) by sliding a kernel window along the data and fitting a straight 
line (or higher-order polynomial) to the weighted data in that window. As with density 
estimation, briefly discussed in Section 7.4.1, the kernel function K is a symmetric 
density function such as the normal kernel function Ky(t) = exp (—t?/2)//2z or the 
quadratic kernel function Kg(t) = Ae —t?) for |t| < 1. Since the data are from a com- 
plex survey, the weights used in fitting the local regression include the survey weights 
as well as the kernel weights. Let x,...,x, and y1,...,y, denote the observations in 
the sample. For local linear regression, the function g at a point ¢ is estimated by: 


a(t) = [1 O\CXPW,X,)'X) Wry, 


where 


and 


t— n t— n 
W, = ding| Me (252)... Ma) 


Calculate a local linear regression function for the NHANES data, using y = triceps 
skinfold and x = body mass index. 


(Requires theory of linear models.) Suppose the “true” model describing the relation 
between x and y is 


Yi|xi = Bo + Bixi + €:, 
where the ¢; are independently generated from a N(0,o?) distribution. Let © be a 
matrix with diagonal entries 67,00, segs foe: What is the covariance matrix for the 


ordinary least squares parameter estimators? How does this relate to the discussion 
of different estimators of the variance in Section 11.2.2? 
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TABLE 11.1 
(a) Population Counts and (b) Sample Sizes for Exercise 30. 


(a) Population Counts (b) Sample Sizes 
Age Group Age Group 
<25 >25 <25 >25 
F 50 100 F 15 5 
M 250 900 M 26 14 


The coefficient of determination R? is often reported for regression analyses. For a 
straight line regression, the finite population quantity R? is defined to be 


N 
By > i — Xv) — Fu) 
R = i=1 


N 
>> Oi - yu 
i=l 


a Show that R? is the square of the population correlation coefficient R defined in 
(4.1). 


b Write R? as a function of population totals. 


c Give an estimator R? of R? for data from a complex sample, using weights. 


Fienberg (1980) says, “we know of no justification whatsoever for applying standard 
multivariate methods to weighted data . . . the automatic insertion of a matrix of 
sample-based weights into a weighted least-squares analysis is more often than not 
misleading, and possibly even incorrect.” Which approach to regression inference 
does Fienberg advocate? What is your reaction? 


Assuming a model 
Yi = Bo + Bixi + 
with V(¢;) = 07, what is the generalized regression estimator of ty? 
Suppose the population counts for a cross-classification by age and gender are given 


in Table 11.1(a). Table 11.1(b) gives the sample sizes in each group for a sample from 
the population. The sampling weight for each observation is w; = 20. 


a_ Find the poststratification weight adjustments g; for observations in each of the 
four cells. Show that poststratification adjustments must always be positive, but 
they can be less than one. 


b Now find weight adjustments g; based on the model 
yi = Bo + Bix + Bizi tei, Vulei) = 0°, 


where x; = | if observation i is female and 0 otherwise, and z; = 1 if observation 
iis 25 or younger and 0 otherwise. Do the weight adjustments have to be positive 
in this model? 
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D. Projects and Activities 


Trucks. Use the data in Exercise 34 of Chapter 3 and Exercise 27 of Chapter 7 for 

this exercise. 

a_ Fit a straight line model predicting y = miles_annl from x = model year (adm_ 
modelyear). Give a 95% CI for the slope. 

b How well does this model fit the data? 


ce What other variables in the data set might be useful for predicting y? Fit a multiple 
regression model predicting y using x variables of your choice. 


Baseball data. Use the data from Exercise 29 in Chapter 7. What variables in the 
data do you think might be useful for predicting log(salary)? Fit a multiple regression 
model predicting log(salary) from these variables. 


IPUMS exercises. 


a__Regress inctot on covariates of your choice using your sample from Exercise 38 
of Chapter 5. Write a paragraph interpreting the results of your analysis. 


b_ Perform a logistic regression predicting whether a person is in the labor force 
(variable labforce) from covariates of your choice. 


Activity for course project. Using the survey you chose in Exercise 31 of Chapter 7, 
use regression methods to predict a response of interest from covariates in the data. 
If the survey has no continuous responses, use logistic regression to predict a binary 
response. Make sure you plot the data appropriately. 


Two-Phase Sampling 


Nearly the whole of the states have now returned their census. | send you the result, which as far as 
founded on actual returns is written in black ink, and the numbers not actually returned, yet pretty well 
known, are written in red ink. Making a very small allowance for omissions, we are upwards of four 
millions; and we know in fact that the omissions have been very great. 


—Thomas Jefferson, letter to David Humphreys, August 23, 1791 


Sometimes, you would like to use stratification, unequal-probability sampling, or 
ratio estimation to increase the precision of your estimator, but the sampling frame 
lacks information on useful auxiliary variables. For example, suppose you want to 
sample businesses with probability proportional to income but do not have income 
information in the sampling frame. Or you want to estimate the total timber volume 
that has been cut in the forest by measuring the total volume in a sample of truckloads 
of logs. Timber volume in a truck is related to the weight of the truckload, so you 
would expect to gain precision by using ratio estimation with y; = timber volume in 
truck i and x; = weight of truck 7. But the ratio estimator bye = tyly / ty requires that the 
total weight for all truckloads be known, and weighing every truck in the population 
is impractical. 

Two-phase sampling, also called double sampling, provides a solution. Two- 
phase sampling, introduced by Neyman (1938), is useful when the variable of interest 
y is relatively expensive to measure, but a correlated variable x can be measured fairly 
easily and used to improve the precision of the estimator of t,. It may also be used to 
adjust for nonresponse, to sample rare populations, or to improve the sampling frame. 
We discuss some of these applications later in the chapter. 

Suppose the population has N observation units. The sample is taken in two 
phases: 


1 Phase I sample. Take a probability sample of n“ units, called the phase I sam- 
ple. Measure the auxiliary variables x for every unit in the phase I sample. In 
the survey of businesses, you could take a simple random sample (SRS) of tax 
records and record the reported income for each business in the sample. For mea- 
suring timber volume, you could weigh a sample of trucks selected either with 
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an SRS or with probability proportional to estimated timber volume. The phase I 
sample is generally relatively large (and can be large because the auxiliary infor- 
mation is inexpensive to obtain), and should provide accurate information about 
the distribution of the x’s. 


2 Phase II sample. Now act as though the phase I sample is a population and select 


a probability sample of size n™ from the phase I sample. Measure the variables of 
interest for each unit in the subsample, called the phase IT sample. Since you are 
treating the phase I sample as the population from which the phase II sample is 
drawn, you may use the auxiliary information gathered in phase I when designing 
the phase II sample. You might select the businesses to be contacted with probabil- 
ity proportional to the income measured in the phase I sample. Alternatively, you 
might use the income information to stratify the businesses in the phase I sample 
and then contact a randomly selected subset of the businesses in each income 
stratum to obtain the desired information on variables such as total expenses. You 
could select the truckloads on which timber volume is to be measured with prob- 
ability proportional to weight, or you could use the information in the phase I 
sample to obtain a better estimate of total weight and use ratio estimation. In each 
case, the y variables are relatively expensive to measure, but y is correlated with x. 


Two-phase sampling can save time and money if the auxiliary information is 
relatively inexpensive to obtain and if having that auxiliary information can increase 
the precision of the estimates for quantities of interest. 


Stockford and Page (1984) used two-phase sampling to estimate the percentage of 
Vietnam-era veterans in U.S. Veterans Administration (VA) hospitals who actually 
served in Vietnam. 

The 1982 VA Annual Patient Census (APC) included a random sample of 20% 
of the patients in VA hospitals. The following question was included: “If period of 
service is “Vietnam era, was service in Vietnam?” with answer categories “yes,” 
“no,” and “not available.” The answers to the question were obtained from patients’ 
medical records. But the response from medical records could be inaccurate for several 
reasons: (1) The medical record classification was largely self-reported, and the patient 
may not have been able to recall the location of service due to medical problems, or 
may have been confused about the definition of Vietnam service (some pilots whose 
duty station was officially recorded as Thailand flew missions over Vietnam); (2) a 
patient might misstate Vietnam service because he or she thought the answer might 
affect VA benefits; or (3) errors might be made in recording the response in the medical 
record. In addition, a large number of patients had “not available” for the answer. Thus, 
the answer to the question on Vietnam service in the APC survey was unsatisfactory 
for estimating the percentage of Vietnam-era veterans in VA hospitals who served in 
Vietnam. 

Stockford and Page checked the military records for a stratified subsample of 
the hospitalized veterans to determine the true classification of Vietnam service. The 
information in the original survey was used for the stratification, as different per- 
centages with Vietnam service were expected in the “yes”, “no,” and “not available” 
groups in the APC survey. Military records for all of the patients in the “not avail- 
able” stratum were checked. It was expected that the within-stratum variances would 
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be relatively low in the “yes” and “no” strata—even though the APC survey data 
are inaccurate, you would expect a higher percentage of “yes” respondents to have 
served in Vietnam than “no” respondents—and military records for a 10% subsample 
were checked for each of those two strata. 

The results for the question “Was service in Vietnam?” were as follows: 


APC APC Survey Subsample Vietnam Service 
Group Classification Size in Subsample 
Yes 755 67 49 

No 804 72 11 

Not available 505 505 211 

Total 2064 644 271 


As expected, the percentage of veterans with Vietnam service differed for the 
three groups: Of the veterans with a “yes” response to the APC survey question, 73% 
actually served in Vietnam, compared with 15% for the “no” group and 42% for the 
veterans for which the information was not available. = 


Two-phase sampling is often used in forestry surveys. Aerial photographs are avail- 
able for the region of interest, and points are systematically distributed across the 
photographs. Areas around the points are inspected on the photographs and classified 
by land class: forest land, unproductive forest land, nonforest land, and water. A phase 
I sample of points is then drawn from the grid, with a higher sampling fraction for 
grid points classified as forest land than those classified as nonforest land. Areas in 
the phase I sample are examined more closely to classify them by stand size and 
density. Then, a subsample is taken of the points in the phase I sample, and ground 
measurements such as land use, volume, and mortality taken; the percentage of area 
that is forest from the phase II ground sample may differ somewhat from the photo 
estimate in phase I, and ratio estimation can be used in the phase II sample to increase 
the precision of the estimator. us 


We have already seen two-phase sampling used in nonresponse adjustment, in Sec- 
tion 8.3. A probability sample is taken from the population; the sampled units are then 
divided into the two strata of respondents and nonrespondents. Then a subsample is 
taken of the nonrespondents. The phase I sample is the original probability sample. 
The variable 


1 if observation i responds 
x= : kaa 
; 0 if observation i is a nonrespondent 


is observed for everyone in the phase I sample. Then the information about x; is used 
in the phase II sample. The variable of interest y; is observed for all observations with 
x; = 1; a subsample is taken for observations with x;=0. = 


What is the difference between a two-phase design, discussed in this chapter, and 
the two-stage designs discussed in Chapters 5 and 6? In two-phase sampling, the 
phase I sample is used to collect inexpensive auxiliary information on the sampling 
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units; this information is then used to improve the efficiency of the phase II sample 
design. In two-stage sampling, different sizes of sampling units are collected at the 
two stages. All primary sampling units (psus) selected at stage 1 of a two-stage design 
are subsampled. In a two-phase design, it is possible that some psus sampled in the 
phase I sample will not be represented at all in the phase II sample. A two-stage design 
might have hospitals as psus at the first stage, and then subsample patients as ssus at 
the second stage. The final two-stage sample contains patients from each psu selected 
in the first stage. A two-phase sample of hospitals might take a probability sample of 
hospitals in phase I, then divide the hospitals in the phase I sample into strata based 
on number of heart attack patients. In phase II, a stratified random subsample of the 
hospitals would be selected using the stratification information from phase I. 


2.1 
Theory for Two-Phase Sampling 


A general framework for two-phase sampling is given in Sarndal and Swensson 
(1987) and Legg and Fuller (2009). Let S“’ denote the phase I sample; the units 
selected for the sample are determined by the random variables 


z= | 1 if unit 7 is in the phase I sample 
‘“ ]0 — if unit i is not in the phase I sample. 
Let w‘”, for i¢ S, be the sampling weights for the phase I sample: w‘” = 
1/[P(Z; = 1)]. We observe a vector of auxiliary characteristics x; = (xj1,Xj2,.--, Xin)? 
for each observation unit in the phase I sample. Using the theory developed in earlier 
chapters, we can estimate the population total for auxiliary variable j as 


N 
a1) _ My eee 
ty, = SS W; Xi =) Zw; Xij. 
ieSO) i=l 


Now, indicate membership in the phase II sample S® by the random variable 


D= 1 if unit 7 is in the phase II sample 
‘“~ )0 — if unit i is not in the phase II sample. 


The probability that a unit is in the phase II sample depends on whether it is in the 
phase I sample and also may depend on auxiliary information collected in the phase 
I sample; we denote this dependence by writing P(D; = 1 | Z), where Z is the vector 
(Z1,Z2,...,Zy). Thus, when we find an expectation conditional on Z, we are treating 
the information from the phase I sample as known. The subsampling weights for the 
final, phase II sample also depend on which units were selected to be in the phase I 
sample. 


1 


ee. aE al 
P(D; = 1|D 


wo = w(Z) = 
0 if Z; = 0. 


12.2 Two-Phase Sampling with Stratification 47} 


An analog of the Horvitz-Thompson estimator for two-phase sampling is 


= fa wPwy; = ra Z;D; ww Oy, (12.1) 
Pee 
Kott and Stukel (1997) call (12.1) the double expansion estimator; it “expands” the 
weight on y; by the product of the two sampling weights. 
We use the following device to find properties of the estimator in (12.1). The phase 
II sample is selected by treating the phase I sample as the population, so we can find 
properties of the subsample relative to phase I using standard methods. Define 


N 
a = a wy; = > Zw)? y;. 
ieS) i=l 

Now, we do not know what i) is, because we only observe the y;’s in the phase II 
sample. But ie serves as the “population total” estimated in phase II—if we knew y; 
for all units in the phase I sample, we would estimate 4, by oe Treating the phase I 
sample as known, we have 


Efi? |Z] = Sz Oy:E[D; |Z] = Sani Sy 
Then, using successive saa (see Section wa 


El] = EXEL | Z}} = £| oa v= =t. (12.2) 


Also, from Property A.4 in Section A.4, 
VG?) = VEL” |Z) + BVI” |Z) = VG) + EVE? |Z). (12.3) 


The first term is the variance that would be obtained if y; had been observed for every 
observation in S“; the second term is the additional variance from subsampling in 
phase IT. Consequently, the variance from two-phase sampling is always larger than if 
we measured y on every unit in the phase I sample of n“ units. We hope, though, that 
if y is related to x, the second term in (12.3) will be smaller than the variance of an 
estimator of t, from a sample of size n that does not use the auxiliary information 
in the design. 
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Two-Phase Sampling with Stratification 


In two-phase sampling with stratification, information on a stratification variable is 
selected in phase I. That information is then used to select a stratified sample (the 
phase II sample) from the phase I sample. For simplicity, assume that an SRS is taken 
in phase I, and that stratified random sampling is used in phase II. Sarndal et al. (1992, 
Chapter 9) give a more general treatment, allowing unequal-probability sampling for 
either phase. Define S“, S®, Z;, and D; as in Section 12.1. If an SRS of size n is 
taken in phase I, then P(Z; = 1) = n/N. 
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The observation units are divided among H strata, but we do not know stratum 
membership for a unit until it is selected in phase I. In the population, however, stratum 
h has Np, units (assume N, is unknown) and N = ae Ny, (assume N is known). Let 


skies 1 if unit 7 is in stratum h 
‘k= 10 if unit 7 is not in stratum h. 


Observe xj,, h = 1,...,H, for each unit in the phase I sample. The number of 
units in the phase I sample that belong to stratum h is a random variable: 


N 
ny = ) ZiXih- 
i=I 


Now take a simple random subsample of size m,, in stratum h; m, may depend on the 
first phase of the sampling. The subsamples in different strata are selected indepen- 
dently, given the information in the phase I sample. With random subsampling, 


H 
My, 
PD; = 1| Z) = Z; ) BS 
A 
h=l 


Although P(D; = 1 | Z) is written as a sum, all but one of the x;,’s (i = 1,..., H) will 
equal zero because each unit belongs to exactly one stratum, so that P(D; = 1 | Z) = 
Zim; /ny for unit i determined to be in stratum h. The sampling weight for a phase II 
unit in stratum h is w? = = n;,/mp; in general, w” =Zj 4 Xintn/Mh. 

The two-phase-sampling stratified estimator of the population total is 


72) (1), (2) 
he = Sap Wi W; Yi 


i N n 
h 
= ji — —_ Xi i 12. 
) ZD (> na) (12.4) 


where — = Vices Xinyi/mn is the average of the phase II units in stratum h. We 
showed in (12.2) that E Feel = t,. The corresponding estimator of the population 
mean is 


A Nh _ 
ee yo Em Ue ->% yo: (12.5) 


N ies 


Recall that a stratified random sampling estimator of the population total from 
(3.1) is 


the two-phase-sampling estimator simply substitutes n;,/n for N,/N. 
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The variance is also computed conditionally using (12.3): 


v (i) =v (Efe 12) +2(y [fe 2) 
id nN 
A h- 
=V(i)) +NPE(V] D5, 1Z 
h=1 


welt 2) 3 a yag| (2 (1 me) 


The first term is the variance from the phase I SRS; the second term is the additional 
variance resulting from the subsampling in phase II. Here, Ss? = ae (vy; — yu)?/ 
(N — 1) is the population variance of the y’s; 


1 
21 : : -(1)\\2 


ted ieSO 


(12.6) 


would be the sample variance of the y;’s in stratum h in the phase I sample if we 
observed all of them. The second term in (12.6) is left as an expectation because nj, 
and m, are random variables. 

Rao (1973) estimates the variance in two-phase sampling with stratification as 


Vv (i ae N(N 1) > (% Np — 1 Mp — 1 Nh Cray 
str N—1 


n Mp 


N2 g Nh (2) A(2 
1 ny (59? 52)" 12.7 
er ae st ( ) 


h=1 


where 


is the sample variance of the y;’s in stratum h (see Exercise 11). If we can ignore the 
finite population corrections (fpcs), 


aa 


H H 
= 1 Np S Ny (2) a2 
762) pe h ‘5 ~ (59° 52)" 12.8 
sr) © HL ao =] Ystr ( ) 
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EXAMPLE 12.4 Let’s apply these results to the data in Example 12.1. Because y a = Py iS a proportion, 


rg = mnPnUl — pr)/(mp — 1). The statistics from the phase II sample are as follows: 


Stratum Nh Mp, Ph so 
Yes 755 67 0.7313 0.1995 
No 804 72 0.1528 0.1313 


Not available 505 505 0.4178 0.2437 


Total 2064 644 
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The estimated percentage of Vietnam-era VA hospital patients who served in Vietnam 
is, from (12.5), 


S(2) 


The phase I sample is an SRS with n/N = 0.2, so the fpc should be included in the 
variance estimate. Calculating the terms in (12.7), 


2 n,—1 m,—1\ nn, se 
3 = 0.000391-+0.000271-+0.0000231 = 0.000686, 
n—1 N-1/) n m, 


H 
1 : 
(1 ~) Y> 2P — FO? = 1.29 x 10-5 + 1.16 x 10-9 + 1.24 x 10-8 
n 


= 00000245. 
Thus, 1(5)= 0.000686 + 0.0000245 = 0.00071, and SE(¥) = 0.027. 


str 

Was two-phase sampling more efficient here? Had an SRS of size 644 been taken 
directly from the records, and had p = 0.429 been observed, the standard error would 
have been SE(p) = 0.019, which is actually smaller than the standard error from 
the two-phase sampling design. If you look at the individual terms in the variance 
estimates, you can see why two-phase sampling did not increase efficiency in this 
example. All of the phase I units in the “not available” stratum were subsampled, 
giving a very low value of se /m, for that stratum. But the sample sizes in the 
other two strata were too small, leading to relatively large contributions to the overall 
variance from those two strata. 

Suppose proportional allocation had been used in the phase II sample instead 
and that the same sample proportions had been observed. Then, you would 
subsample 236 records in the “yes” stratum, 251 records in the “no” stratum, and 
157 records in the “not available” stratum. In that case, if the sample proportions 
remained the same, the standard error from the two-phase sample would have been 
0.017, a modest decrease from the standard error of an SRS of size 644. But pro- 
portional allocation does not make the most efficient use of the phase I information. 
More savings would have been achieved if some sort of optimal allocation had been 
used (see Exercise 9). = 


Ideally, you would use the information about stratum membership from phase I to 
have a more efficient sampling design in phase II. This usually means using optimal 
allocation in the stratified phase II sample. For example, in a survey to study total 
sales of manufacturing firms, you might obtain total revenue from the tax records 
(x) for a sample of manufacturing firms. Then you could use that tax information to 
stratify the phase I sample by the reported revenue, and take higher sampling fractions 
in phase II for the strata with higher revenue in the tax records. 

A screening survey is a special case of a two-phase sample using stratification. 
The U.S. National Immunization Survey (Smith et al., 2005) collects a phase I sample 
using random digit dialing. The households in the phase I sample are divided into two 
strata: (1) households with children 19-35 months old, and (2) households with no 
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children between 19 and 35 months old. Since the goal of the survey is to estimate 
vaccination rates for children in the 19-35 month age group, no households in stratum 
2 are included in the phase II sample. The parent or guardian in stratum | households is 
asked to consent for information to be obtained from the child’s vaccination providers, 
and those providers are asked about the child’s immunizations. Nonresponse and other 
nonsampling errors in the phase II sample require weighting adjustments. 


McNamee (2003) discusses the use of two-phase sampling to estimate disease preva- 
lence. In the first phase an inexpensive, but not completely accurate, method is used 
to classify persons as having the disease or not. The second phase is a more accurate 
test for the disease. For example, the phase I survey might ask people whether they 
have diabetes, and divide the respondents into stratum 1, persons who say they have 
diabetes, and stratum 2, persons who say they do not have diabetes. But some persons 
with diabetes are unaware that they have it. You therefore need to subsample both 
strata in the phase II sample, which evaluates persons through a medical examination, 
to guarantee that diabetics who are unaware they have diabetes can be included in 
the sample. Although we expect a smaller fraction of the persons in stratum 2 to have 
diabetes, compared with the fraction in stratum | who have diabetes, the character- 
istics of persons with diabetes in stratum 2 might be quite different from those in 
stratum 1. 


Ratio and Regression Estimation in 
Two-Phase Samples 


The stratified two-phase sampling design in Section 12.2 uses the auxiliary informa- 
tion collected from the phase I sample in the design of the phase IJ sample. Alterna- 
tively, or in addition, the information about the auxiliary variables x can be used in 
the estimator, through ratio and regression estimation. 


12.3.1 Two-Phase Sampling with Ratio Estimation 


Suppose that x, a variable thought to be highly correlated with y, can be measured 
inexpensively in the phase I sample. Define S“, S, Z;, and D; as in Section 12.1. 
The auxiliary variable x; is measured for each observation in the phase I sample; from 
that sample, we may estimate the population total 4, = ee x; by 


N 
el 1 1 
i( y= De wi y= AY Wi 
i=l 


ieSW) 


Now select the phase II subsample and measure y; on units in the subsample. From 
the phase II sample S, we can calculate #\” using (12.1) and 


N 

“(2 y 1 2 > ; 1 2, 

z ) = wi vw! ye; — Z:Diw' vw! Xp. 
ic S2) i=1 


EXAMPLE 12.6 
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Then, 
2 iy 2 
4 aay 'y Xap 
tr = hoy = HBO. (12.9) 
ty 
Note that this estimator is very similar to the ratio estimator in (4.2); we use fins from 
the phase I sample instead of the unknown quantity f,. 
Using linearization, 


tyt 

YexX (2 

= G — t,). 
ty 


A ty A Lyx 
12) & ty + . > — 5) + - @ — 4) 
x x 
Then, 
a ‘a ty ny 
VIP) V & + ae - 2] 
x 
t ty 
“02 ya) _ 32 “2 Ya) _ 32 
= v fe [7 + 2am -72)|2]| +e{v [9 +20 2)|2]| 
x A ty. 
= Vit?) + £[V(iP - 27 |z)| 
x 
= vit] +E[ VG? 1Z)], 
where d; = y; — (ty/t,)x;. Thus, the variance of the two-phase ratio estimator is 
the variance that would be calculated for mY if we observed y; for every unit in the 
phase I sample, plus an extra term involving the variance of the residuals from the 
ratio model. 


If an SRS of n“ units is taken for phase I and an SRS of n® units is taken in 
phase II, then 


\ g2 n2\ $2 
AO ye 2 | gic Ns SO nag ee 
VAD) ~N (: i ) ay +N (: =) OE (12.10) 


where d; = y; — Bx; and $3 = )~™., d?/(N — 1), and 


aie (D\ 52 (2) 2 
nam) = (1-5) Few (1- 5) 3 (12.11) 


N ne)’ 


where 8? = Diese i — YP? /( — 1) and 82 = Diese (i — BOX)? /(™ — 1), is 
an approximately unbiased estimator of VE?) (see Exercise 12). Another estimator 
of the variance is given in Exercise 13. 


Suppose, for the population in agpop.dat sampled in Examples 2.5 and 4.2, that 
we do not know the value of x; = acres&7, the acreage devoted to farms in 1987, 
or of ¢, before sampling. To use x as auxiliary information through two-phase ratio 
estimation, we take an SRS of size 400, measure acres87 on every unit in this phase I 
sample, and then take an SRS of size 30 from the phase I sample to serve as the phase 
II sample. We measure y = acres92 on the units in the phase IT sample. We then 
employ ratio estimation to estimate ¢,, using 2 to estimate the unknown auxiliary 
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population total ¢,. Using the SAS code on the website, we calculate 


322,385 
335,444 


72) 
72) = je ay pas ; = 1,002814.347 


72 * 5X0) ) = 963,774,784. 


From (12.11), 


P Os n\ 52 
RO a ga PEP eg ee GSES “yh ONG. 
Vr) =N (1 N )3 are (1 =) 72 


ogy (i 400 ) 112,160,218,976 n 
- 3000 400 
=2.7x 10>. 


1 30 \ 1,908,426,448 
400 30 


Thus, an approximate 95% confidence interval (CI) for t, is 963,774,784 + 2.05 x 
V2.7 x 10 = [856,924,072, 1,070,625,496]. The corresponding CI for yy is 
[285,641, 356,875] with y2 = 321,258. The widths of these CIs are comparable 
to those of the SRS of size 300 in Example 2.10, even though here y; was measured 
only on the 30 units in the phase II sample. If measuring x is inexpensive relative to 
measuring y, the high correlation between x and y makes two-phase sampling with 
ratio estimation very efficient. m= 


l1.32 Generalized Regression Estimation in Two-Phase 
Sampling 
We can also use a two-phase version of the generalized regression (GREG) estimator 
of Section 11.7. The two-phase GREG estimator takes the form 


2 a(2 ql 4(2)\T P(2 
Legg Sty FE IPB, (12.12) 
where 
B® = | > ww? xn) > wy xy, (12.13) 
icS) iceS 


and the constants o are determined by the analyst. In (12.12), the estimator is cali- 
brated to the estimated population totals of x from phase I, t (see Exercise 15). 

The estimator in (12.12) may be written using a modification of the weights. 
Analogously to (11.22), let 


* - 1 1 
L. 2 
gi= lt (a = t)? ) wi vw! xix? — Xj. 
oO: 
ieSO) ! ! 


Then, 


72 (1), (2) 
L.GREG = ss W; W; Siyi- 
ieS@ 
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We again use Property A.4 in Section A.4 to find V ( *exea 


a2 (2 2 
V (®eaec) = V [E (*eaec Z) | + E[V (Reco |Z) | (12.14) 
Since the GREG estimator is approximately unbiased, if the sizes of the phase I and 
phase II samples are sufficiently large, then 


V [E (i2eea!2)] =v 0), 


where Vir) is the variance of the estimator Ne =D est) w°”y; of the population 
total we would have if we had been able to measure y on every unit in the phase I 
sample. By linearization, the conditional variance in the second term of (12.14) is 


( (1D), (2) 7D) 
V (Rua |Z) =V So wi wiral \Z 
ieS 
£ : -1 
where d\” = y; — x7 B® and BY = (Dies w{? xix?) ies wy xii. 
S4rndal et al. (1992, Chapter 9) estimate the two terms in (12.14) ish aes The 


conditional variance in the second term is unbiased for its expectation, so E Lv (72 GREG | zZ)| 
may be estimated by 


v Coe | z)=V > wy? a 


ieS® 


with e; = y; — x;B substituted for the unknown values qe If the phase I sampling 
design is an SRS of n) units, the first term in (12.14) may be estimated by 


M\ §2 
Ay yy2 (yp MY) By 
va )=N (: W ) 7b? 


with 


¢2 yy? 
S= a pe OPN 


| to 


Estimating the first term in (12.14) is more challenging for complex designs; Exer- 
cise 16 presents an estimator for this situation. 


Barnett et al. (2001) describe an application of two-phase sampling to an account- 
ing problem. The auditor has access to a large phase I SRS of transactions. Each 
transaction in the phase I sample has been checked by internal auditors, who record 
errors they find. A small random subsample is taken of the phase I transactions; each 
transaction in the phase II SRS is examined by an external auditor. If the internal 
and external auditor disagree on a transaction, the external auditor is assumed to be 
correct. If the internal auditors are largely correct, then regression estimation can be 
used, either through ratio estimation or poststratification into classes based on types 
of errors, to greatly increase the precision of the amounts of errors in the population 
of transactions. 
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Jackknife Variance Estimation for Two-Phase 


Sampling 


As we have seen, the formulas for variance estimators are complicated in two-phase 
sampling. Fortunately, in many cases we can use resampling methods to estimate 
variances. In this section, we describe jackknife variance estimators for two-phase 
sampling, studied by Rao and Shao (1992), Rao and Sitter (1995, 1997), Kott and 
Stukel (1997), Sitter (1997), and Kim et al. (2006). The jackknife method presented 
in Section 9.3.2 needs to be modified for two-phase sampling because y is observed 
only for units selected in phase II. We mimic the sampling design, including the two 
phases of sampling, in the resamples. The jackknife estimates the with-replacement 
variance; it is a good approximation to the without-replacement variance if the sam- 
pling fractions are small. 

Suppose the phase I design is an SRS of size n“ and the phase II design is an 
SRS of size n™. Consider the ratio estimator of ty in Section 12.3.1, 

72) 


72) — 7) fy 
hy = hy 30" 


When we delete unit j in the phase I sample, we obtain 


72) 
72) — PO) 9 SG 
yr) ~~ “x(/) 32), 2 

t(j) 


where ee bee and io fe are calculated using the jackknife weights as described in the 


next paragraph. Then, 


e302) n —1L fra _ 2Q) 
Vics) = DY no [2-2] 
jes 


When both samples are SRSs, wh? = N/n for the phase I sample and w? = 
n\) /n® for the phase II sample. The jackknife weights for phase I are constructed 
exactly like those in Section 9.3.2. When unit j from the phase I sample is deleted, 
the modified weight for phase I is 


0 ifi=j 


Q) _ ql) 
i ae [meaner OT ae 


n®—1! 


The modified weight for phase II depends on whether the unit deleted from the phase 
I sample is in the phase II sample or not: 


0 ifi=jandj ¢ S® 
nD —] 
wo=alom— ifiAFjandjes® 
Wig) = y n@ —1 
nD —] 
ifj ¢S 


n2 
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Using the jackknife weights, 
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i) = Dy Mati = Baa 
ieSY 
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For two-phase sampling with stratification, suppose that an SRS of size n") is taken 
in phase I and a stratified random sample is taken in phase II, where the sampling 
fractions for Bs strata are specified before the phase I sample is collected. Consider 
the estimator i?) in (12.4). Define the jackknife replicate, deleting unit j, by 


32) A), (2) 
beng) = a Win Win? 


ie S@) 
where 
Np : 
—— if Xn = 1, xjn # 1 
0 ifj=i a a 
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See Kim et al. (2006) for jackknife variance estimators for other designs and estima- 
tors. As always, the jackknife estimates the with-replacement variance—in two-phase 
sampling, both phases are assumed to be sampled with replacement. 
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Designing a Two-Phase Sample 


Two-phase sample designs require all the considerations of one-phase samples, plus 
the additional decision of how many resources to devote to each phase. A two-phase 
sample is more complicated than a one-phase sample; before you use one, study the 
relative costs and make sure a two-phase sample really will be more efficient. The 
two-phase design for the veterans survey in Examples 12.1 and 12.4 was actually 
less efficient than a one-phase design would have been. A two-phase sample uses 
resources to measure x on units that are not subsampled in phase II—resources that 
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could alternatively be used to measure y on additional units. If x and y are strongly 
related, then the information in the phase I sample improves the efficiency of data 
collection. But if x and y are not related—for example, if x is the last digit of a student’s 
telephone number and y is the student’s grade point average—then the resources used 
to measure x are essentially wasted; you would be better off if you just sampled y 
directly. 

Deming (1977) discusses issues to be considered when deciding whether to use a 
two-phase sample. A two-phase design adds complexity for both administering and 
analyzing the survey. It also can increase respondent burden, since in many cases 
respondents need to be contacted twice. If a two-phase design is used to identify 
persons with a certain characteristic such as diabetes, persons ultimately selected for 
the phase II sample may first be asked to answer a questionnaire for phase I and then 
be asked to participate in a medical examination for phase II. However, the two-phase 
sample has the advantage that it can give useful information about the screening 
method used in phase I. 


12.5.1 Two-Phase Sampling with Stratification 


Consider the situation in which phase I is an SRS and phase II is a stratified random 
sample. Efficiency gains for two-phase sampling arise when more observations are 
subsampled in strata with large variance, large values of Np, or low cost. Rao (1973) 
proposes letting m;, = v,n; for stratum h, with v,, h = 1,...,H, being constants to 
be determined before sampling. Let c”) be the cost to sample a unit in the SRS taken 
for phase I and to determine its stratum membership. Let c;, be the cost of measuring 
y for a unit in stratum / in phase II. Assume the total cost will be a linear function, 
with 


H 
C= cn + cnmp. (12.15) 
h=1 


The total cost C varies from sample to sample, since the ms are only determined 
after the phase I sample is taken. The expected cost, however, is 


H 
E[C] = cCPn 4 n© > CnrVnWh (12.16) 
h=1 


where W, = N;,/N. With v;, fixed, we can write V9) from (12.6) as: 


str 
H 
a 1 1 1 1 
2(2)) _ 2 y 2 
Ve) = Sy (= = x) + Ae WS), (— = 1) . 
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Then VG) is minimized, subject to the constraint in (12.16), when 


() 2 
eee aad (12.17) 
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(see Exercise 17). If vpopt > 1 for a stratum h, then set vjop¢ = 1 and recalculate the 
other values. With a predetermined expected cost C*, the phase I sample should have 
size 


C* 
(1) 
Nopt = H 
1 
OO + x ChWiVh,opt 
h=1 
If 0 < Vaopt < 1 for h = 1,...,H and the optimal allocation is used, then, as shown 


in Exercise 18, the two-phase sample has variance 


2 


A 2 
r 1 S 
Vino eS ce ~ WiSn/Cn + VC is-Ems y WS; | — WV (12.18) 
h=1 


Finding the optimal sample sizes for two-phase sampling with stratification requires 
estimates of the within-stratum variances Ss, similarly to the optimal allocation in 
stratified sampling discussed in Section 3.4.2. In addition, since the stratum sizes are 
unknown, the values of W,, and S? must also be estimated or guessed. 

We can compare the variance achieved by a two-phase stratified sample with 
optimal allocation with the variance that we would get if we measured y on a one- 
phase sample with the same cost. For simplicity, assume c, = c® forh = 1,...,H. 
This is a reasonable cost structure for two-phase studies used to estimate disease 
prevalence, for example, in which all persons sampled in phase II are given the same 
medical examination. If, instead of taking a two-phase sample, we took an SRS in one 
phase with the same cost C*, we could sample n’ = C*/c™ units. Then, if Ss? /N and 
(1—n'/N) are negligible, the ratio of the two-phase variance with optimal allocation to 
the one-phase variance with the same expected cost is approximately 


2 
a H 
Vopr sir) ym2 4 fe? 1% f= Ss wh Wash | (12.19) 
Vsrs(y) a ce) 


Thus, a two-phase sample with stratification is more efficient than an SRS for esti- 
mating yy if the within-strata variances se are small relative to SS and if the cost to 
sample phase I units is smaller than the cost to sample phase II units. 

For two-phase sampling with stratification, n,, the number of units in the phase 
I sample that are in stratum h, is a random variable. If we select a different phase I 
sample from the population, it is likely that we will get a different value for nj. It is 
possible that some phase I samples will have n, = O for one or more strata. In that 
case, we cannot subsample that stratum in phase I, and y will not be measured on 
any unit in stratum h. The estimator in (12.4) is unbiased for the population total only 
if we assume n;, > 0 for all strata. Similarly, we need a subsample size of at least 2 
in each stratum to estimate the variance within the stratum. We thus want to design a 
two-phase sample so that P(n;, = 0) is extremely small. If an SRS is taken at phase I 
and n/N is small, then P(n, = 0) © (1 — N;,/N)"”, so P(t, = 0) is small if n™ is 
large or N;,/N is not close to 0. Strata should thus be formed so that all strata are large 
enough to have very high probability of being represented in the phase I sample. 


12.6 Chapter Summary ARS 


1252 Optimal Allocation for Ratio Estimation 


How should the sample be allocated to phase I and phase II if ratio estimation is to 
be used? Suppose an SRS is taken in each of phase I and phase II and that the total 
cost of the two-phase sample is C = c'n\ + cn. In Exercise 19, you will show 


that the variance of Gu given in (12.10), is minimized subject to a fixed cost C when 


n®) Dg? 


a ce 12.20 
nD ~~ | c@(S2 — $3) acc 


where Se is the population variance of the residuals d; = y; — Bx;. Consequently, for 


a fixed cost C, the optimal phase I sample size is 
7) = e 
cD) + pe) 


and the optimal phase II sample size is 


C— nO 

c 
The optimal sample sizes can often be estimated using results from a preliminary 
survey or prior work. If x and y are highly correlated, we expect Ss to be small 
relative to S?. In that situation, it makes sense to measure x on a large phase I sample, 
particularly if the cost of measuring x is small, and use a relatively small phase II 
sample to estimate B. 


ae ee 
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Chapter Summary 


Two-phase sampling can increase precision of estimators of t, for a fixed budget if 
there exist auxiliary variables x such that (1) the cost of measuring x is low compared 
with the cost of measuring y and (2) the auxiliary variables are correlated with y. 
The auxiliary information collected in the phase I sample can be used to improve 
the efficiency of the phase II sampling design, as when the auxiliary information x 
collected at phase I is used to stratify the phase II sample. Alternatively, or addition- 
ally, the information in the phase I sample can be used through ratio or regression 
estimation. 


Key Terms 


Phase I sample: A sample selected from a population on which auxiliary variables 
x are measured. 


Phase II sample: A subsample selected from the phase I sample on which the variable 
of interest y is measured. 
Two-phase sampling: A sampling design in which a preliminary (phase I) sample 
is selected from the population, and then a subsample (phase II sample) is selected 
from the phase I sample. 
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Exercises 
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For Further Reading 


Watson (1937) presents an early example of two-phase sample for regression. 
Neyman (1938) developed theory for two-phase sampling with stratification. See 
Cochran (1977) for more discussion on two-phase sampling with simple random 
samples; Sarndal et al. (1992, Chapter 9) and Legg and Fuller (2009) give a the- 
oretical development for general probability sampling designs. Hidiroglou et al. 
(2009) develop Sen-Yates-Grundy-type variance estimators for two-phase samples. 
Rao and Sitter (1995) and Kim et al. (2006) present jackknife variance estimators 
for two-phase sampling. Armstrong et al. (1993) give an example of two-phase sam- 
pling for tax records. Hidiroglou (2001) discusses non-nested two-phase sampling 
designs. 


A. Introductory Exercises 


A health official takes a two-phase sample to estimate the prevalence of diabetes in 
a population. In phase I, an SRS of size 1000 is taken from the population of size 
100,000, and each individual is asked demographic information and whether he or 
she has diabetes. It is known that some demographic groups are more at risk for dia- 
betes than others; in addition, the self report of diabetes may be inaccurate. Therefore, 
each individual in the phase II sample is given a medical exam to determine diabetes 
status. The phase I sample is divided into 4 strata. Stratum h has nj, observations 
in the phase I sample and m, observations in the phase II sample. After the medi- 
cal exam, 7, persons in stratum h of the phase II sample were determined to have 
diabetes. 


Stratum Nh Mp Th 
High risk group and reports diabetes 241 96 86 
High risk group and does not report diabetes 113 45 17 
Low risk group and reports diabetes 174 35 29 
Low risk group and does not report diabetes 472 47 8 


Estimate the total number of persons with diabetes in the population, along with its 
standard error. 


Data mining methods in statistics are used to discover relationships among variables 
in very large data sets (Hastie et al., 2001). A company, for example, has databases of 
all its financial transactions. A very small fraction of these transactions involve fraud, 
but fraudulent transactions are expensive for the company. Discovering whether a 
transaction is fraudulent requires an investigation, so the company can only deter- 
mine whether transactions are fraudulent for a small sample. Discuss how two-phase 
sampling might be used to improve prediction of fraudulent transactions. (Chen et al., 
2002, discuss using two-phase sampling in data mining, but with purposive sampling 
at phase II rather than probability sampling.) 
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B. Working with Survey Data 


Bart and Earnst (2002) describe the use of two-phase sampling with ratio estimation 
to estimate the density of nesting birds. The phase I sample, selected from the 2130 
plots in the region of interest, is conducted using a rapid search method involving bird 
sightings to obtain an approximate count of birds in each phase I plot. Then, a subsam- 
ple of 12 of the phase I sample plots are surveyed using an intensive method to obtain 
a more accurate count of the number of nests in each plot. In the intensive method, a 
surveyor visits a plot for several hours over a period of days and searches for nests and 
other indications of territorial males in the plot. In this setting, x; = number of nests 
counted in plot i using the rapid method, and y; = number of nests counted in plot i 
using the intensive method. Using the data in file shorebirds.dat, which were gener- 
ated using summary statistics from Bart and Earnst (2002), estimate the total number 
of nests using the two-phase ratio estimator. Give the standard error of your estimate. 


Dunn et al. (1999) discuss issues in analyzing two-phase data to estimate prevalence of 
psychiatric disorders. Participants in a phase I sample were given the General Health 
Questionnaire (GHQ) and classified into three strata based on their GHQ score. The 
stratification was used to take a stratified random sample of 250 persons for the phase 
II sample; the Composite International Diagnostic Interview (CIDD, considered to 
be a more accurate diagnostic tool, was administered to each person in the phase II 
sample. The CIDI score was used to classify the phase IT sample members as having 
at least one psychiatric disorder (case) or having no psychiatric disorder (non-case). 
The results are given in the following table. The counts of cases and non-cases are 
from the phase II sample. 


Stratum Nh Mp Non-case Case 
GHQ < 3 (low) 1049 60 33 27 
GHQ = 4, 5 (medium) 237 48 14 34 
GHQ > 6 (high) 272 142 23 119 
Total 1558 250 


a_ Calculate the phase II sampling weight w? for each stratum. 


b Use the two-phase sample to estimate the percentage of persons with at least 
one psychiatric disorder, along with its standard error. Since we do not know the 
ian . . . dd) 
population size N, use a relative phase I weight of w; ’ = 1. 


Dunn et al. (1999) also classified the phase IT sample by gender. In the following table, 
the entries in columns 3-7 are the counts from the phase II sample in the categories: 
Male Non-Case (MNC), Male Case (MC), Female Non-Case (FNC), and Female 
Case (FC). 


Stratum Nh Mp MNC MC FNC FC 
GHQ < 3 (low) 1049 60 16 8 17 19 
GHQ = 4, 5 (medium) 237 48 9 8 5 26 


GHQ > 6 (high) 272 142 15 28 8 91 
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a__ Estimate the percentages of persons in each cell of a2 x 2 contingency table classi- 
fied by gender and case/non-case. Find the standard error of each entry in the table. 


b Find the design effect for each cell proportion pj and marginal proportion (p;+ 
and p+;) in the table. 


c Use the Rao-Scott method (Section 10.3.2) to test Hp : pj = Pi+P+j- 


Ismail et al. (2002) report results of a two-phase survey to estimate prevalence of psy- 
chiatric disorders in Gulf War veterans. A random sample of the N = 53,462 Gulf War 
veterans was administered the SF-36 questionnaire, a 36-question survey on health. 
Respondents who scored below 72.2 on physical functioning subscale were defined as 
disabled (stratum 1, with, = 406); respondents who scored above 72.2 were defined 
as not disabled (stratum 2, with ny = 3047). A random subsample of 111 veterans 
was taken from stratum 1, and a random subsample of 98 veterans was taken from 
stratum 2. The 209 veterans in the phase II sample were evaluated by psychiatrists; 
the counts in the table below give the number of veterans who are determined to have 
any alcohol related disorder (Alcohol), any sleep disorder (Sleep), or any psychiatric 
related disorder (Psych). A veteran can be in more than one of these categories. 


Stratum Np Mh Alcohol Sleep Psych 
Disabled 406 111 8 20 27 
Not disabled 3047 98 10 17 12 


Estimate the total number of Gulf War veterans with an alcohol related disorder, and 
give a 95% CI. 


Repeat Exercise 6 for sleep disorders. 
Repeat Exercise 6 for any psychiatric disorder. 


Use the results of Section 12.5 to determine an optimal allocation for a follow-up sur- 
vey similar to that in Example 12.1. Assume that the relative costs are c) = 1 and c, = 
20 for h = 1, 2,3. Use the data in Example 12.1 to estimate quantities such as W;, and 
Ss. How does your allocation differ from the one used? From proportional allocation? 


C. Working with Theory 


(Requires probability.) Suppose the phase I sample is an SRS of size n“, and the 
phase II subsample is an SRS of size n, with n® <n“. Show that 


Q\ g§2 

r), __ yr fy” y 

V(t y=N (1 are ) 7 
the same variance that would result if a SRS of size n™ were taken directly. 


Estimating the variance in two-phase sampling for stratification. Show that (12.7) 


is an approximately unbiased estimator of V (ir) in large samples. HINT: Use the 
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result derived from Table 3.3 in Chapter 3 that 


H H 
= bp (Ni — DSi + 2 NAGnw — 50 /(N — 1). 


h=1 h=1 


(Requires probability.) For two-phase sampling with ratio estimation (Section 12.3.1), 
suppose the phase I sample is an SRS of size n"), and the phase II sample is an SRS 
of fixed size n. 


a Show that P(Z; = 1) = n/N, and P(D; = 1 | Z) = Zn® /n™. 
b Show that (12.10) gives the approximate variance of 7). 


e Lete;=y;—- Bx; and let S and ie be the sample variances of the y;’s and the 
e;'s from the a II sample, 


1 
is “Oo1 oe (i — yOy and s? — On] a e 
| 50) iceS@ 
Show that (12.11) is an approximately unbiased estimator of Va). 


Rao and Sitter (1995) propose an alternative linearization variance estimator for the 
situation in Exercise 12, 
a__ Using part (b) of Exercise 12, show that 
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where Syq = Sey (x; — Xy)d;/(N — 1). HINT: Write 
N 


$= Vor yoy = me Bx; + Bx; — Bxy)’. 
b Show that 
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is an approximately unbiased estimator of VE), where 
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Demnati—Rao (2004) linearization variance estimator in two-phase sampling. The 
linearization variance estimator presented in Exercise 23 of Chapter 9 can be extended 
to two-phase sampling. Let 6 be the population quantity of interest, and define the 
estimator 6 to be a function of the vectors of sampling weights for the phase I and 
phase II samples and the population values: 


A 


1 
6 = g(w'?, W,X1,X2,-6-, Xm Vis Yoo ss Yu)s 


1 1 1 
(1) aw ye (1) 


where w') = (w, with w‘” the phase I sampling weight of unit i(w\” = 0 
if i is not in the shake ji sample), w = Wis te an with w; the final sampling weight 
of unit i in the phase II sample (w; = w\?w® if i € S® and w; = 0 if i ¢ S®), x; 
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is the vector of population values for the jth auxiliary variable (measured in phase I), 
and y; is the vector of population values for the jth response variable (measured in 
phase II). Now let 


Sis dg(w), W, X1,X2,...5Xms 1s Y2s--- Ye) 
aws) 


and 


ao) _ ag(w), w, X1,X2,---)Xm Yi, Y2.--- > Yk) 
: Ow; : 


Then, 


Vor (6) = vty wz me + Ss wie” 


ieS) ieS@ 


a Consider the two-phase ratio estimator in (12.9). We can write 
dy wii 
a2) __ (1), ieS®) 
2 =D wae, 
ieS) Se Wixi 


(1) w?, Show that the Demnati-Rao linearization variance estima- 


where w; = w; 
tor is 


a 1 
Vopr = Vi wi BO 42 “O 25 wily; — Bx;) 


ie SY fy ieS?) 


b Suppose that the phase I sample is an SRS of size n) and the phase II sample is 
an SRS of size n. What is Vpr () for this case? 


Show that if the estimator in (12.12) is applied to any of the auxiliary variables in x, 
then 72 = 
xGREG ~ *x ° 
(Requires probability.) Suppose the phase I sample is an unequal-probability sample 
of observations. If we observed y; for every unit in S“, we could use the Horvitz— 
Thompson estimator of the variance of the Horvitz-Thompson estimator in (6.22) to 
estimate the first term in (12.14): 
a) ()_() 
CD) 21) Wig — Mi Me Yi Yk 
Var) = dU a OO” 
ieS) keSM) se ae 


where a‘ = P(Z; = 1) and x‘? = P(Z,Z, = 1) fori # k and x? = P(Z; = 1). We 
need an estimator of Vay), Lowered that depends only on the y values in the phase 
Il sample. Let x = P(D;D; = 1 | Z) > 0. Show that 

a) ()_() 


VD) — Wi ~ Tj My Yi Vk 
Var) = YO DO _@ () 
ieS@ Kes@® Mik Mik Wj 


is an unbiased estimator of VG). 
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(Requires calculus.) Optimal allocation for two-phase sampling with stratification. 
Suppose phase I is an SRS and phase II is a stratified random sample, and that the 
total cost for the sample is given in (12.15), where c" is the cost to sample a unit in 
phase I and cy, is the cost to sample a unit in stratum h/ in phase II. Let v, = my/np, 
h =1,...,H be the proportion of phase I units in stratum h to be sampled in phase II. 


a Show that the expected cost is (12.16). 


b Show that V6) is minimized, subject to the constraint in (12.16), when v, is 


given in (12.17). Hint: Use Lagrange multipliers. 


Show that when the optimal allocation is used, the variance of yO 


sampling with stratification is given by (12.18). 


for two-phase 


(Requires calculus.) Show that if an SRS of size n‘ is taken in phase I, and an SRS 
of size n® is taken in phase II, then taking the ratio n® /n™ in (12.20) minimizes the 
variance in (12.10) for a fixed cost C. 


This exercise is based on results in McNamee (2003) on the use of two-phase 
sampling to estimate disease prevalence. An inexpensive, but possibly inaccurate, 
screening test for the disease is given in the phase I sample, an SRS of size n“. Let 
x; = 1 if person i tests positive on the screening test and x; = 0 if person i tests 
negative on the screening test. Persons are then classified into stratum 1 (x; = 0) and 
stratum 2 (x; = 1). The persons sampled in phase II are given a test for the presence 
of the disease that, for purposes of this exercise, is assumed to be 100% accurate: 
The phase II response is y; = | if person i has the disease and 0 otherwise. We can 
write the population values in a contingency table: 


Screening Test 


Negative Positive 
Disease No Ci Cr Ci4 
present? Yes Cx Cy Coy 
Cy =M Cy. = Ny N 


We wish to estimate p = yy = C2,/N from the two-phase sample; p; = C2;/N, and 
P2 = Co2/N> are the proportions with the disease in strata | and 2, respectively. 


a_ Epidemiologists often use the concepts of specificity and sensitivity to assess a 
test for a disease, with 


Cc 
S; = Specificity = P(test is negative | disease absent) = a 
I+ 
and 
bite , Cx 
Sy = Sensitivity = P(test is positive | disease present) = a 
2+ 
Show that 
N MN 
—p, = (1 — S2)p, —(1- =(1-p)Si, 
yet = (= S2)p yl ev = C — psi 
N2 


es epee 
yy P2 = Pe2 N P2) = P 1). 
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b Suppose that the optimal allocation is used (see Section 12.5.1) and that 0 < 
Vaopt < 1 for h = 1,2. Using (12.19) and part (a), show that 


2 
Vopt(Bar) ce 
Vesey VU = $2)8, + VS21 — $1) +R -@ |? 


where R is the population Pearson correlation coefficient between x and y, given 
in (4.1). Hint: For the second term, first show that RS, = p(S2 — W2)/./Wi W>. 

c Calculate the ratio of variances in (b) when S; = Sz and R = min{S; + S$) — 
0.9, 0.95}, for S, € {0.5,0.6, 0.7, 0.8, 0.9, 0.95} and ce /e™ € {0.0001, 0.01, 0.1, 
0.5, 1}. Display your results in a table. For which settings would you recommend 
two-phase sampling to estimate disease prevalence? 


Inverse Sampling. Hinkins et al. (1997) (also see Rao et al., 2003) note that in some 
situations one might want to apply a statistical procedure developed for an SRS to data 
from a complex survey, but the stratification and clustering in the survey make direct 
application inappropriate. They propose an inverse sampling algorithm to create a 
subsample from the complex survey that is an SRS from the population, essentially 
by inverting the procedure used to draw the complex sample. The procedure can be 
repeated multiple times. 

Suppose that the complex survey is a stratified random sample. The population 
stratum sizes are Ni,..., Ny and the sample sizes are 11,..., 4. It would be possi- 
ble for all the observations in an SRS from the population to be in one stratum, so 
the maximum possible size of the subsample is m = min{n,,...,n4}. Use the hyper- 
geometric distribution to generate subsampling sizes m,..., my from the strata, with 
yy mp = m, where 


1 N N. N, 
PUM = mi Ma = My... Mar =m) = a ( ai al a 
() my m2 my 


m 


In stratum h, select an SRS of m» of the n, sampled units to be in S 2) 


a__ Show that the probability that any subset of m units in the population is selected 
as the sample through this procedure (first taking a stratified random sample of 


-1 
size n, then using inverse sampling to select a subsample of size m) is (*) , 


b Use inverse sampling to select a subsample of size 21 from the stratified random 
sample in file agstrat.dat, first discussed in Example 3.2. 


Ranked set sampling. McIntyre (1952) proposed ranked set sampling as a method of 
improving precision of estimates by a method related to two-phase sampling. Stokes 
(1980) and Patil (2002) recommend ranked set sampling for situations in which the 
response of interest, y, may be difficult to measure but it is easy to estimate an approx- 
imate ranking of the units in a sample, either visually or by using a correlated variable. 
In McIntyre’s application y; = pasture yield of field 7. Measuring y; requires mowing 
and weighing—a time-consuming process. An expert, however, can assess and rank 
a small number of fields from lowest to highest yield by visual inspection, which is 
much less effort. 
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To implement ranked set sampling, select k independent SRSs, each of size k. Rank 
each of the k samples from low to high, using either judgment or a correlated easy-to- 
measure variable. (This ranking must be done without knowing any values of y;.) Then 
select the smallest unit of the first sample for measurement of y, the second smallest 
unit of the second sample, and so on until the largest unit of sample k is selected. Repeat 
this procedure until m replicates are obtained. At the end of the process mk? units 
have been ranked and y has been measured on a sample of n = mk units. Let y; be the 
mean of the y-values measured in replicate j,j = 1,...,m, and let )pss = + ai 

We illustrate the method with a small example using the data in agpop.dat (Husby 
et al., 2005, have a similar example using NHANES data). We want to estimate the 
population total for y = acres92. We take mk SRSs, each of size k, using m = 10 and 
k = 4 (typically, k is relatively small to allow an expert to rank the elements). We 
rank each of the mk SRSs using the correlated variable acres87. Here are the values of 
acres&7 for the 4 samples in the first replicate (SAS code used to obtain the samples 
is on the website). 


Observation 
Sample 1: x 119,956 144,986 108,861 302,659 
Rank 2 3 1 4 
Sample 2: x 351,106 294,551 80,104 226,954 
Rank 4 3 1 2 
Sample 3: x 241,276 253,421 702,173 412,225 
Rank 1 2 4 3 
Sample 4: x 529,964 823,729 355,973 121,119 
Rank 3 4 2 1 


We then measure y on the third unit in Sample | (which has rank 1), the fourth unit in 
Sample 2 (rank 2), the fourth unit in Sample 3 (rank 3), and the second unit in Sample 
4 (rank 4), obtaining the y values 106,206, 246,038, 379,044, and 783,715. Note that 
since y is highly correlated with x, these four y values are forced to be spread out. 
The same procedure is repeated for the remaining nine replicates. We obtain 


JI y2 ¥3 y4 Ys 
378,750.75 51,841 280,658.5 187,791.5 175,436.75 
¥6 yy Je Jo Yi0 
446,398.5 1,092,582.5 499,146 350,570 665,457.75 


Consequently, Yrss = + jay = 412,863.33. 
How is ranked set sampling similar to two-phase sampling? How does it differ? 


Argue that if the ranking is based on an auxiliary variable x, and the value of y itself 
is not used in the ranking, then yrss is an unbiased estimator of the population 
mean. 
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ce Show that if the m replicates are selected independently, then 


AA 1 i = A 
VOrss) = n= dX (3; — Yess)” 


is an unbiased estimator of VOyrss). (HINT: See Section 9.2.) Using this method, 
we obtain SE(yrss) = 93,943.21 for our sample. 


d_ The properties of ranked set sampling require that the ranking in each of the mk 
samples be done using the same method. What might go wrong in practice? How 
might a ranker who knows the sampling procedure produce bias in the estimates? 


D. Projects and Activities 


Forest data. Treat the data in forest.dat, described in Exercise 36 of Chapter 2, as the 
population for this problem. Suppose you are interested in estimating the total number 
of cells with cover type | (spruce/fir). Select an SRS of 5000 records. Determine the 
elevation of each cell, and form two strata: (1) elevation less than 3000 m and (2) 
elevation greater than or equal to 3000 m. Now select a stratified random sample of 
500 of the records in your phase I sample, and use the subsample to estimate the total 
number of cells with spruce/fir. You may want to use a small pilot sample to estimate 
the optimal allocation of sample sizes in the two strata. 


Ranked set sampling using the forest data. Use the ranked set sampling procedure 
described in Exercise 22 with the forest data. We expect the variables x = Hillshade_ 
9am and y = Hillshade_3pm to be negatively correlated. 

a_ Draw an SRS of size 25 from the data. Find y and V6) using formulas from 
Chapter 2. 

b Draw 100 SRSs of size 4 from the forest data, and arrange these as 25 sets of 4 
SRSs. Rank the x variables in each SRS, and choose one y value from each SRS 
using the ranked values. Find yrss and V(yrss) for your data. 

ec How do the two estimated variances compare? Does ranked set sampling improve 
precision for this example? 


Estimating Population Size 


| caught a large number of fishes in the neighbourhood of Suez. | passed a copper ring through their 
tails, and threw them back into the sea. Some months later, on the coast of Syria, | caught some of my 
fish ornamented with the ring. 


—Jules Verne, Twenty Thousand Leagues Under the Sea 


13.1 


Capture—Recapture Estimation 


EXAMPLE 13.1 Suppose we want to estimate N, the number of fish in a lake. One method is as follows: 
Catch and mark 200 fish in the lake, then release them. Allow the marked and released 
fish to mix with the other fish in the lake. Then, take a second, independent sample of 
100 fish. Suppose that 20 of the fish in the second sample are marked. Then, assuming 
that the population of fish has not changed between the two samples and that each 
catch gives a simple random sample (SRS) of fish in the lake, estimate that 20% of 
the fish in the lake are marked, and, therefore, that the 200 fish tagged in the original 
sample represent approximately 20% of the population of fish. The population size 
N is then estimated to be approximately 1000. m= 


This method for estimating the size of a population is called two-sample capture— 
recapture estimation. Other names sometimes used are tag- or mark-recapture, 
multiple record system, the Petersen (1896) method, or the Lincoln (1930) index. 
The method relies on the following assumptions: 


1 The population is closed—no fish enter or leave the lake between the samples. 
This means that N is the same for each sample. 

2 Each sample of fish is an SRS from the population. This means that each fish 
is equally likely to be chosen in a sample—it is not the case, for example, that 
smaller or less healthy fish are more likely to be caught. Also, there are no “hidden 
fish” in the population that are impossible to catch. 
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3 The two samples are independent. The marked fish from the first sample become 
re-mixed in the population, so that the marking status of a fish is unrelated to the 
probability that the fish is selected in the second sample. Also, fish included in 
the first sample do not become “trap-shy” or “trap-happy’”—the probability that 
a fish will be caught in the second sample does not depend on its capture history. 


4 Fish do not lose their markings, and marked fish can be identified as such. Water- 
soluble paint, for example, would not be a good choice for marking material. 


In this simple form, capture—recapture is a special case of ratio estimation of a 
population total, and results from Chapter 4 may be used when the samples and popu- 
lation are large. Let n, be the size of the first sample, nz the size of the second sample, 
and m the number of marked fish caught in the second sample. In Example 13.1, 
ny = 200, no = 100, m = 20, and we used the estimator N= n\Nnz/m. To see how 
this estimator fits into the framework of Chapter 4, let 


y; = 1 for every fish in the lake 


and 


a2, 1 if fish 7 is marked 
“110 if fish iis not marked. 


: N nN a N 
Then estimate N = th = )o_, yi by hy = t.B, where } = Do) x; = m and 


B= y/x = n/m. This ratio estimator, 


VG, (13.1) 
m 
is also the maximum likelihood estimator (see Exercises 13 and 14). Applying (4.10) 


to the second SRS and ignoring the fpc, ie = nz(n2 — m)/[m(n2 — 1)] and 


_ m—-m ninr(n2 —m) 


VN) = 20(B) = ( : 


m m(nz — 1) = m 


For the data in Example 13.1, V(N) = 40,000. 

Being a ratio estimator, though, N is biased, and the bias can be large in wildlife 
applications with small sample sizes. Indeed, it is possible for the second sample to 
consist entirely of unmarked animals, making the estimate in (13.1) infinite. Chapman 
(1951) proposes the less biased estimator 


2 1 1 
fy — ut De +1 
m+1 


1. (13.2) 


A variance estimator for N (Seber, 1970) is 
(n, + I(r. + I(r, — m)(12 — m) 
(m+ 1)2(m + 2) 
The estimators in (13.2) and (13.3) are often used in wildlife applications. For the fish 
data, N = (201)(101)/21 — 1 = 966, and V(V) = 30,131. 
Many researchers have constructed confidence intervals (CIs) for the popula- 


tion size using either N + 1.96 VN ) or N+ 1.96 VN). These are not entirely 


VN) = (13.3) 


13.1 Capture—Recapture Estimation 49] 


satisfactory, however, because both require that N or N be approximately normally 
distributed, and the normal distribution may not be a good approximation to the 
distribution of N or N for small populations and samples. We’ll discuss CIs in 
Section 13.1.2; first, however, let’s look at another approach for these data that will 
be useful in developing CIs. 


13.1.1 Contingency Tables for Capture-Recapture 
Experiments 


Fienberg (1972) suggests viewing capture-recapture data in an incomplete contin- 
gency table. For the data in Example 13.1, the table is as follows: 


In Sample 2? 
Yes No 


Yes 20 180 200 


In Sample 1? 
No 80 ? ? 


100 ? N 


In general, if x; is the observed count in cell (i,j), the contingency table looks as 
follows. An asterisk indicates that we do not observe that cell. 


In Sample 2? 
Yes No 
Yes X11 (=m) X12 X14(=n1) 
In Sample 1? 
No X21 X55 XZ, 
X41(=n2) Xi9 aime 
The expected counts are: 
In Sample 2? 
Yes No 
Yes m m2 M+ 
In Sample 1? 
No m1 M5, my, 
m4 m5 m,=N 


To estimate the expected counts, we use m, = X11, M12 = X\2, and Mm, = x1. If 
presence in sample | is independent of presence in sample 2, then the odds of being 
in sample 2 are the same for marked fish as for unmarked fish: 11; /mj2 = m1 /m22. 
Consequently, under independence, the estimated count in the cell of fish not included 
in either sample is 

m2) X12%21 


m2 = —; =< , 
m\\ X11 
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and 
Ny ~ ~ ~ ~ X4 1X14 nin 
N=m, + M2 +m + m2 = —— = —. 
X11 m 
The estimator WN is calculated based on the assumption that the two samples are 
independent; unfortunately, this assumption cannot be tested because only three of 
the four cells of the contingency table are observed. 


13.12 Confidence Intervals for N 
In many applications of capture—recapture, CIs have been constructed using N+ 


1.96,/ VN ) or N+ 1.96,/ VN ). If we use the first interval for the data in Exam- 


ple 13.1, VN) = 40,000, and an asymptotic 95% CI would be 1000 + 1.96(200) = 
[608, 1392]. The CI using the normal approximation and N is [626, 1306]. Unfortu- 
nately, CIs based on the assumption that N or N follow a normal distribution often 
have poor coverage probability in small samples because the distribution of N and 
N is actually quite skewed, as you will see in Exercise 17. In general, we do not 
recommend using these CIs. 

An additional shortcoming of CIs based on the normal distribution can occur in 
small samples. For example, suppose that nj; = 30, m2 = 20, and m = 15. Then 
N= (30)(20)/15 = 40, and VN ) = 26.7. Using a normal approximation to the 
distribution of N results in the CI [30, 50]. The lower bound of 30 is silly, however; 
a total of 35 distinct animals were observed in the two samples, so we know that NV 
must be at least 35. 

Cormack (1992) discusses using the Pearson or likelihood ratio chi-square test for 
independence to construct a CI. Using this method, we fill in the missing observation 
X22 by some value u and perform a chi-square test for independence on the artificially 
completed data set. The 95% CI for mz is then all values of u for which the null 
hypothesis of independence for the two samples would not be rejected at the 0.05 
level. For the data in Example 13.1, let’s try the value u = 600. With this value, the 
“completed” contingency table is 


In Sample 2? 
Yes No 
Yes 20 180 200 
In Sample 1? 
No 80 600 680 
100 780 880 


We can easily perform Pearson’s chi-square test for independence on this table, obtain- 
ing a p-value of 0.49. As 0.49 > 0.05, the value 600 would be inside the 95% CI for u, 
and the value 880 would be inside the 95% CI for N. Setting u equal to 1500, though, 
gives p-value 0.0043, so 1500 is outside the 95% CI for u, and 1780 is thus outside 
the 95% CI for N. Continuing in this manner, we find that values of u between 430 
and 1198 are the only ones that result in p-value > 0.05, so [430, 1198] is a 95% CI 
for m2. The corresponding CI for N is obtained by adding the number of observed 
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animals in the other cells, 280, to the endpoints of the CI for m2, resulting in the 
interval [710, 1478]. 

The likelihood ratio test may be used in similar manner, by including in the CI 
all values of u for which the p-value from the likelihood ratio test exceeds 0.05. Using 
the computer code given on the website, we find that values of u between 437 and 
1233 give a likelihood ratio p-value exceeding 0.05. The CI for N, using the likeli- 
hood ratio test, is then [717, 1513]. 

Another alternative for CIs is to use the bootstrap (Buckland, 1984). To apply the 
bootstrap here, resample from the observed individuals in the second sample. Take 
R samples of size 100 with replacement from the 20 tagged and 80 untagged fish 
we observed. Calculate N* for each of the R resamples, and find the 2.5 and 97.5 
percentage points of the R values. With R = 999, the 95% CI is the 25th and 975th 
values from the ordered list of the N *, (714, 1538]. 

Note that all three of these CIs resulting from Pearson’s chi-square test, the like- 
lihood ratio chi-square test, and the bootstrap are similar, but all differ from the CIs 
based on the asymptotic normality of NorN. 


13.1.3 Using Capture-Recapture on Lists 


Capture—recapture estimation is not limited to estimating wildlife populations. It 
can also be used when the two samples are lists of individuals, provided that the 
assumptions for the method are met. Suppose you want to estimate the number of 
statisticians in the United States, and obtain membership lists from the American 
Statistical Association (ASA) and the Institute for Mathematical Statistics (IMS). 
Every statistician either is or is not a member of the ASA, and either is or is not a 
member of the IMS. (Of course, there are other worthy statistical organizations, but 
for simplicity let’s limit the discussion here to these two.) Then 7; is the number of 
ASA members, nz the number of IMS members, and m is the number of persons on 
both lists. We can estimate the number of statisticians using N = nn /m, exactly 
as if statisticians were fish. The assumptions for this estimate are as above, but with 
slightly different implications than in wildlife settings: 


1 The population is closed. In wildlife surveys, this assumption may not be met 
because animals often die or migrate between samples. When treating lists as the 
samples, though, we can usually act as though the population is closed if the lists 
are from the same time period. 


2 Each list provides an SRS from the population of statisticians. This assumption 
is more of a problem; it implies that the probability of belonging to ASA is the 
same for all statisticians, and the probability of belonging to IMS is the same for 
all statisticians. It does not allow for the possibility that a group of statisticians 
may refuse to belong to either organization, or for the possibility that subgroups 
of statisticians may have different probabilities of belonging to an organization. 


3 The two lists are independent. Here, this means that the probability that a statisti- 
cian is in ASA does not depend on his or her membership in IMS. This assump- 
tion is also often not met—it may be that statisticians tend to belong to only one 
organization, and therefore that ASA members are less likely to belong to IMS 
than non-ASA members. 


EXAMPLE 13.2 
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4 Individuals can be matched on the lists. This sounds easy, but often proves 


surprisingly difficult. Is J. Smith on List 1 the same person as Jonquil Smith 
on List 2? Larsen and Rubin (2001) describe some of the problems that can occur 
when you try to link records. 


An important application of capture—recapture methods is estimating undercover- 
age in a census. In this setting, sample | is the census and sample 2 is an independent 
probability sample taken from the population. 


The U.S. Census Bureau tries to enumerate everyone in the decennial census. 
Inevitably, however, persons are missed, leading population estimates from the census 
to underestimate the true population count. Moreover, it is thought that the undercount 
rate is not uniform; the undercount is thought to be greater for inner city areas and 
minority groups, and varies among different regions of the United States. As Con- 
gressional Representatives, billions of dollars of federal funding, and other resources 
are apportioned based on census results, it is important that the population counts 
be accurate. Capture—recapture estimation, called dual system estimation in this con- 
text, has been used since 1950 to evaluate the coverage of the U.S. decennial census. 
Fienberg (1992) gives a bibliography for dual-system estimation. 

Hogan (1993) describes the Post-Enumeration Survey (PES) used in the 1990 
U.S. census. The general principles are the same for other census years: Citro et al. 
(2004, Chapters 5-6) describe procedures that were used to assess coverage of the 
census in 2000. Kostanich et al. (2004, Chapter 5-6) describe plans for the Census 
Coverage Measurement program to be used for the 2010 census. A similar procedure, 
called the Reverse Record Check, is used in Canada. Two samples are taken. The 
P-sample is taken directly from the population, independently of the census, and is 
used to estimate number of persons missed by the census. The E-sample is taken 
from the census enumeration itself, and is used to estimate errors in the census such 
as nonexistent persons or duplicates. 

Separate population estimates are derived for each poststratum, where the pop- 
ulation is poststratified by region, race, ownership of dwelling unit, age, and other 
variables. Poststrata are used because it is hoped that assumption 2 of equal recapture 
probabilities is approximately satisfied within each poststratum; we know it is not 
satisfied for the population as a whole because of the differential undercount rates in 
the census. The population table for a poststratum is as follows: 


In Census Enumeration? 
Yes No 
Yes Nu Ni M4 
In PES? 
No Ny N35 Nx, 
N41 Ni2 N 
Then, 
~ NaN 
N= By + 
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The estimates N 14 and N 11 are from the P-sample: N 14 is the estimate of the 
poststratum total, using weights, from the P-sample, and Ni isa weighted estimate 
of matches between the P-sample and the census enumeration. Here, Nay is not the 
actual count from the census, but is the census count adjusted using the E-sample to 
remove duplicates and fictitious persons. Many sample sizes in poststrata were small, 
leading to large variances for the estimates of population count, so the estimates were 
smoothed and adjusted using regression models. 

The assumptions above need to be met for dual-system estimation to give a better 
estimate of the population than the original census data. It is hoped that assumption 
2 holds within the poststrata. Assumption 3 is also of some concern, though, as the 
P-sample also has nonresponse. Persons missed in the census may also be missed 
in the P-sample. Another concern is the ability to match persons in the P-sample to 
persons in the census. Because P-sample persons not matched are assumed to have 
been missed by the census, errors in matching persons in the two samples can lead 
to biases in the population estimates. In the 2000 Census Accuracy and Coverage 
Evaluation, the initial evaluation missed a substantial number of duplicate Census 
enumerations; this was corrected in a revised evaluation (see U.S. Census Bureau, 
2003b). m= 


3.2 


Multiple Recapture Estimation 


The assumptions for the two-sample capture—recapture estimators described in Sec- 
tion 13.1 are strong: The population must be closed and the two random samples 
independent. Moreover, these assumptions cannot be tested, because we observe only 
three of the four cells in the contingency table—we need all four cells to be able to 
test independence of samples. 

More complicated models may be fit if K > 2 random samples are taken, and 
especially if different markings are used for individuals caught in the different sam- 
ples. With fish, for example, the left pectoral fin might be marked for fish caught in 
the first sample, the right pectoral fin marked for fish caught in the second sample, 
and a dorsal fin marked for fish caught in the third sample. A fish caught in Sample 4 
that had markings on the left pectoral fin and dorsal fin, then, would be known to have 
been caught in Sample 1 and Sample 3, but not Sample 2. 

Schnabel (1938) first discussed how to estimate N when K samples are taken. She 
found the maximum likelihood estimator of N to be the solution to 


K 
n; — 1)M; 
( 1 i) I = re 
; N — M; ; 
i=1 i=1 
where n; is the size of sample i, 7; is the number of recaptured fish in sample i, and 
M,; is the number of tagged fish in the lake when sample i is drawn. 
If individual markings are used, we can also explore issues of immigration 
or emigration from the population, and test some of the assumptions of 
independence. 


EXAMPLE 13.3 
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Domingo-Salvany et al. (1995) used capture—recapture to estimate the prevalence 
of opiate addiction in Barcelona, Spain. One of their data sets consisted of three 
samples from 1989: (1) a list of opiate addicts from emergency rooms (E list), (2) 
a list of persons who started treatment for opiate addiction during 1989, reported to 
the Catalonia Information System on Drug Abuse (T list), and (3) a list of heroin 
overdose deaths registered by the forensic institute in 1989 (D list). A total of 2864 
distinct persons were on the three lists. Persons on the three lists were matched, with 
the following results: 


In D list? 
Yes No 
In T list? In T list? 
Yes No Yes No 
Yes 6 27 314 1728 
In E list? 
No 8 69 712 ? 


Itis unclear whether these data will fulfill the assumptions for the two-sample capture— 
recapture method. The assumption of independence among the samples may not be 
met—if treatment is useful, treated persons are less likely to appear in one of the 
other samples. In addition, persons on the death list are much less likely to subse- 
quently appear on one of the other lists; the closed population assumption is also not 
met because one of the samples is a death list. Nevertheless, an analysis using the 
imperfectly met assumptions can provide some information on the number of opiate 
addicts. Because there are more than two samples, we can assess the assumptions 
of independence among different samples by using loglinear models. There is one 
assumption, though, that we can never test: The missing cell follows the same model 
as the rest of the data. m= 


If three samples are taken, the expected counts are: 


In Sample 3? 
Yes No 
In Sample 2? In Sample 2? 
Yes No Yes No 
Yes m1 M21 ™\12 M22 
In Sample 1? 
No m1 my21 m2 My 


Loglinear models were discussed in Section 10.4. The saturated model for three 
samples is 


In my = + 0; + Bi + VRE + (OB) + (OVY)iK + BY) + COBY ijx- 


This model cannot be fit, however, as it requires eight degrees of freedom (df) and 
we only have seven observations. The following models may be fit, with a referring 
to the E list, 6 referring to the T list, and y referring to the D list. 
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1 Complete independence. 
In mi = + cj + Bj + Ye. 


This model implies that presence on any of the lists is independent of presence 
on any of the other lists. The independence model must always be adopted in 
two-sample capture—recapture. 


2 One list is independent of the other two. 
In myx = + 0; + Bj + Ve + (OB) yy. 


Presence on the E list is related to the probability that an individual is on the T list, 
but presence on the D list is independent of presence on the other lists. There are 
three versions of this model: the other two substitute (ay), or (By) jx for (@B)j- 


3 Two samples are independent given the third. 
In mig = + 0; + By + Ve + (OB)y + (AY)ix- 


Three models of this type exist; the other two substitute either (a), + (By) jx or 
(ay) + (BY)ix for (aB)j + (ay)ix. Presence on the death and treatment lists are 
conditionally independent given the E list status—once we know that a person is 
on the emergency room list, knowing that he or she is on the death list gives us no 
additional information about the probability that he or she will be on the treatment 
list. 


4 All two-way interactions. 


In mye = + 0; + Bi + Ve + (OB)y + (COv)in + (BY) jx- 


This model will always fit the data perfectly: It has the same number of parameters 
as there are cells in the contingency table. 


Unfortunately, in none of these models can we test the hypothesis that the missing 
cell follows the model. But at least we can examine hypotheses of pairwise inde- 
pendence among the samples. For the addiction data, the following loglinear mod- 
els were fit from the data, using the SAS PROC CATMOD code on the website 
(any loglinear model program that finds estimates using maximum likelihood will 
work): 


Model G df p-value M9 N 95% CI 
1 Independence 1.80 3 0.62 3,967 6,831 [6,322, 7,407] 
2a E*T 1.09 2 0.58 4,634 7,499 (5,992, 9,706] 
2b E*D 1.79 2 0.41 3,959 6,823 [6,296, 7,425] 
2c T*D 1.21 2 0.55 3,929 6,793 (6,283, 7,373] 
3a E*T, E*D 0.19 1 0.67 6,141 9,005 (5,921, 16,445] 
3b E*T, T*D 0.92 1 0.34 4,416 7,280 (5,687, 9,820] 
3c E*D, T*D 1.20 1 0.27 3,918 6,782 (6,253, 7,388] 
4 E*T, E*D, T*D — 0 — 7,510 10,374 (4,941, 25,964] 
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Here, G’ is the likelihood ratio test statistic (deviance) for that model. Somewhat 
surprisingly, the model of independence fits the data well. The predicted cell counts 
under model 1, complete independence, are: 


In D list? 
Yes No 


In T list? In T list? 
Yes No Yes No 


Yes 5.1 28.3 310.8 1730.7 


In E list? 
No 11.7 64.9 712.4 3966.7 


These predicted cell counts lead to the estimate 
N = 2864 + 3967 = 6831 


if the model of independence is adopted. The values of N for the other models are 
calculated similarly, by estimating the value in the missing cell from the model and 
adding that estimate to the known total for the other cells, 2864. 

We can use an inverted likelihood ratio test (Cormack, 1992) to construct a CI for 
N using any of the models. A 95% CI for the missing cell consists of those values 
u for which a 0.05-level hypothesis test of Ho : m222 = u would not be rejected for 
the loglinear model adopted. Let G?(u) be the likelihood-ratio test statistic (deviance) 
for the completed table with u substituted for the missing cell, let t be the total of the 
seven observed cells, and let a be the estimate of the missing cell using that loglinear 
model. Cormack shows that the set 


where q1(q@) is the percentile of the ra distribution with right-tail area a, is an approxi- 
mate 100(1—a@)% CI for m222. We give a computer program for calculating Cormack’s 
CI on the website. This CI is conditional on the model selected and does not include 
uncertainty associated with the choice of model. Cormack also discusses extending 
the inverted Pearson chi-square test for goodness of fit, which produces a similar 
interval. Buckland and Garthwaite (1991) discuss using the bootstrap to find CIs 
for multiple recapture using loglinear models; they incorporate the model-selection 
procedure into each bootstrap iteration. 

For these data, the point estimate and CI appear to rely heavily on the particular 
model fit, even though all seem to fit the observed cells. Note that the estimate N 
is larger and the CIs much wider for models including the E*T interaction, even 
though that interaction is not statistically significant. The good fit of the independence 
model is somewhat surprising because you would not expect the assumptions for 
independence to be satisfied. In addition, the population is not closed, but we have 
little information on migration in and out of the population. 


3d 
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Chapter Summary 


13.4 


Exercises 


Multiple samples from a population may be used to estimate its size. In the simplest 
form, two independent SRSs are taken and the number of population units found 
in both SRSs is used to estimate the population size. If the two samples are not 
independent—in particular, if individuals in the first sample are more likely to also 
appear in the second sample—then N calculated assuming independence is likely to 
underestimate the population size NV. 

Some forms of dependence can be assessed if three or more samples are taken. In 
that case, loglinear models can be fit to the data and used to predict the value of the 
missing cell. 


Key Terms 


Capture-recapture estimation: A method for estimating population size in which 
two independent samples are taken and the overlap used to estimate NV. 


Dual-system estimation: A form of capture—recapture estimation used to estimate 
undercount in a population census. 


For Further Reading 


In this chapter, we have just presented an introduction to estimating population size, 
under the assumption that the population is closed. Much other research has been done 
in capture—recapture estimation, including models for populations with births, deaths, 
and migrations; good sources for further reading are Seber (1982), Pollock (1991), and 
the review papers by the International Working Group for Disease Monitoring and 
Forecasting (1995a, 1995b). Chao et al. (2001) summarize recent research in capture— 
recapture methods used to estimate disease prevalence and provide links to S-Plus 
programs. 


A. Introductory Exercises 


Suppose that an SRS of 500 fish is caught from a lake; each is marked and released, 
and the fish are allowed to remix with the other fish in the lake. A second sample of 
300 fish has 120 marked fish. Estimate the total number of fish in the lake, along with 
a 95% CI. 


B. Working with Survey Data 


Investigators in the Wisconsin Department of Natural Resources (1993) used 
capture-recapture to estimate the number of fishers in the Monico Study Area in 
Wisconsin. 
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a__ In the first study, 7 fishers were captured between August 11, 1981 and January 
31, 1982. Twelve fishers were captured between February 1 and February 19, 
1982; of those 12, 4 had also been captured in the first sample. Give an estimate 
of the total number of fishers in the area, along with a 95% CI. 


b_ In the second study, 16 fishers were captured between September 28, 1982 and 
October 31, 1982, and 19 fishers were captured between November | and Novem- 
ber 17, 1982. Eleven of the 19 fishers in the second sample had also been caught 
in the first sample. Give an estimate of the total number of fishers in the area, 
along with a 95% CI. 


ec What assumptions are you making to calculate these estimates? What do these 
assumptions mean in terms of fisher behavior and “catchability”? 


Alexander et al. (1997) apply capture—recapture methods to estimate the number of 
Mead’s milkweed plants in a tract in Kansas. In some years, Mead’s milkweed plants 
do not produce aboveground parts; in addition, if they are in dense vegetation and are 
not flowering, they are difficult to observe. Thus, a census of observed plants in any 
given year is likely an undercount. From the first two years of observation, 15 plants 
were observed in year | but not in year 2, 12 plants were observed in year 2 but not 
in year 1, and 33 plants were observed in both years. Estimate the total number of 
plants in the tract, along with a 95% CI. 


Bellemain et al. (2005) relied on moose hunters in Norway to collect fecal sam- 
ples from brown bears. Each sample was genotyped, and the number of distinct 
individuals was found in each of 2001 and 2002. In 2001, 311 unique genotypes were 
obtained (134 males and 177 females). In 2002, the procedure was repeated and 239 
unique genotypes were obtained (106 males and 133 females). 165 of the individuals 
sampled in 2001 were also sampled in 2002. 


a__ Fifty-six bears in the area in 2001 had also been followed with radio transmitters; 
36 of these bears were represented in the 311 genotypes from the 2001 feces 
samples. Estimate the number of bears in 2001, along with a 95% CI. 


b In 2002, 57 bears had radio transmitters, and 28 of them were among the 239 
genotypes from the 2002 feces samples. Estimate the number of bears in 2002, 
along with a 95% CI. 


c Estimate the number of bears, along with a 95% CI, treating the samples from 2001 
and 2002 as independent SRSs (and ignoring the radio transmitter data). What 
assumptions are needed to use capture—recapture estimators of population size? 


Domingo-Salvany et al. (1995) also used capture—recapture on the emergency room 
survey by dividing the list into four samples according to trimester (TR). The following 
data are from Table 1 of their paper: 


TRI yes TRI yes TRI no TRI no 
TR2 yes TR2 no TR2 yes TR2 no 


TR3 yes, TR4 yes 29 35 35 96 
TR3 yes, TR4 no 48 58 80 400 
TR3 no,TR4 yes 25 77 50 376 


TR3 no, TR4 no 97 357 312 ? 
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Fit loglinear models to these data. Which model do you think is best? Use your model 
to estimate the number of persons in the missing cell, and construct a 95% CI. 


Chao et al. (2001) report data on an outbreak of Hepatitis A virus among students of a 
college. Investigators want to estimate N, the total number of students with Hepatitis 
A. Cases were reported from three sources: (1) a serum test conducted by the Institute 
of Preventive Medicine (P list), (2) local hospital records from the National Quarantine 
Service (Q list), and (3) records collected by epidemiologists (E list). The following 
table gives the counts from the three sources: 


P list? Q list? E list? Count 


no no yes 63 
no yes no 55 
no yes yes 18 
yes no no 69 
yes no yes 17 
yes yes no 21 
yes yes yes 28 


a Suppose that only the P list and Q list had been collected, withn,; = 135,n2 = 122, 
and m = 49. Calculate N, Chapman’s estimate N, and the standard error for each 
estimate. 


b_ Fit loglinear models to the data. Using the deviance, evaluate the fit of these 
models. Is there evidence that the lists are dependent? 


Cochi et al. (1989) recorded data on congenital rubella syndrome from two sources. 
The National Congenital Rubella Syndrome Registry (NCRSR) obtained data through 
voluntary reports from state and local health departments. The Birth Defects Moni- 
toring Program (BDMP) obtained data from hospital discharge records from a subset 
of hospitals. Below are data from 1970 to 1985, from the two systems: 


Year NCRSR BDMP Both 


1970 45 15 2 
1971 23 5) 0 
1972 20 6 2 
1973 22 13 3 
1974 12 6 1 
1975 22 9 1 
1976 15 7 2 
1977 13 8 3 
1978 18 9 2 
1979 39 11 2 
1980 12 4 1 
1981 4 0 0 
1982 11 2 0 
1983 3 0 0 
1984 3 0 0 
1985 1 0 0 
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a_ The authors state that the NCRSR and the BDMP are independent sources of 
information. Do you think that is plausible? What about the other assumptions 
for capture—recapture? 


b Use Chapman’s estimate (13.2) to find N for each year for which you can calcu- 
late the estimate. What estimate will you use for the years in which Chapman’s 
estimate cannot be calculated? 


ce Now aggregate the data for all the years, and estimate the total number of cases 
of congenital rubella syndrome between 1970 and 1985. How does your estimate 
from the aggregated data compare with the sum of the estimates from (b)? Which 
do you think is more reliable? 


d_ Is there evidence of a decline in congenital rubella syndrome? Provide a statistical 
analysis to justify your answer. 


Frank (1978) reports on the following experiment to estimate the number of minnows 
in a tank. The first two samples used a minnow trap to catch fish, while the third 
used a net to catch the fish. Minnows trapped in the first sample were marked by 
clipping their caudal fin, and minnows trapped in the second sample were marked 
by clipping the left pectoral fin. 


Sample 1? Sample 2? Sample 3? Number of Fish 


yes yes yes 17 
yes no yes 28 
no yes yes 52 
no no yes 234 
yes yes no 80 
yes no no 223 
no yes no 400 


Which loglinear model provides the best fit to these data? Using that model, estimate 
the total number of fish, and provide a 95% CI. 


In the experiment in Exercise 8, what does it mean in terms of fish behavior if there 
is an interaction between presence in sample | and presence in sample 2? Between 
presence in sample | and presence in sample 3? 


Egeland et al. (1995) use capture—recapture to estimate the total number of fetal 
alcohol syndrome cases among Alaska natives born between 1982 and 1989. Two 
sources of cases were used: thirteen cases identified by private physicians, and 
45 cases identified by the Indian Health Service (IHS). Eight cases were on both 
lists. 


a_ Estimate the total number of fetal alcohol syndrome cases. Give a 95% CI for 
your estimate, using either the inverted chi-square test or the bootstrap method. 


b The capture-recapture estimate relies on the assumption that the two sources of 
data are independent—that is, a child on the IHS list has the same probability of 
appearing on the private physicians list as a child not on the IHS list. Do you think 
this assumption will hold here? Why or why not? What advice would you give 
the investigators if they were concerned about independence? 
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ce Suppose that children who are seen by private physicians are less likely to be seen 
by the IHS. Is N then likely to underestimate or to overestimate the number of 
children with fetal alcohol syndrome? Explain. 


C. Working with Theory 


Note that in (13.1), NV =n, /p, where p is the sample proportion of individuals in the 
second sample that are tagged. Use the linearization method of Chapter 9 to find an 
estimator of V(N). 


The distribution of N in (13.1) is often not approximately normal. The distribution of 
Pp = m/ny, however, is often close to normality, and CIs for p are easily constructed. 
For the data in Example 13.1, find a 95% CI for p. How can you use that interval to 
obtain a CI for N? How does the resulting CI compare with others we calculated? Is 
the interval symmetric about N? 


(Requires mathematical statistics.) In a lake with N fish, n, of them tagged, the 
probability of obtaining m recaptured and n2 — m previously uncaught fish in a simple 
random sample of size 7 is 


lees 
m Ny —m 
LN | n,M2) = 7 ; 


The maximum likelihood estimator N of N is the value which maximizes L(N)—t1t is 
the value that makes the observed value of m appear most probable if we know n,; and 
nz. Find the maximum likelihood estimator of N. HINT: When is L(V) > L(N — 1)? 


(Requires mathematical statistics.) Maximum likelihood estimation of N in large sam- 
ples. Suppose that n, of the N fish in a lake are marked. An SRS of ny fish is then taken, 
and m of those fish are found to be marked. Assume that N, 11, and nz are all “large.” 
Then the probability that m of the fish in the sample are marked is approximately: 


cny=(™\(S)" 0-2)". 
m}]}\N N 
a Show that V = nyNn2/m is the maximum likelihood estimator of NV. 
b Using maximum likelihood theory, show that the asymptotic variance of N is 
approximately N7(N — 1)/(nn2). 


(Requires calculus.) For the situation in Exercise 14, suppose the cost of catching a 
fish is the same for each fish in the first and second samples, and you have enough 
resources to catch a total of nj + m2. = C fish altogether. If N and C are known and 
C < N, what should n; and nz be to minimize the variance in Exercise 14(b)? 


(Requires probability.) 
a For Chapman’s estimator N in (13.2), let X be the random variable denoting 


the number of marked individuals in the second sample. What is the probability 
distribution of X? 


b Show that E[N] = N ifm > N—ny. 
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Suppose the lake has N fish, and n; of them are marked. A sample of size nz is 
then drawn from the lake. Choose three values of N, 1), and nz. Approximate the 
distribution of N by drawing 1000 different samples of size nz from the population of 
N units and drawing a histogram of the N that result from the different samples. Repeat 
this for other values of N, n;, and in. When does the histogram appear approximately 
normally distributed? 


D. Projects and Activities 


Try out the two-sample capture—recapture method to estimate the total number of 
popcorn kernels or dried beans in a package, or to estimate the total number of coins 
in a jar. Describe fully what you did, and give the estimate of the population size 
along with a 95% CI for N. How did you select the sizes of the two samples? 


Repeat the preceding exercise, using three samples and loglinear models. Would you 
expect the model of complete independence to fit well? Does it? 


EXAMPLE 14.1 


Rare Populations and Small Area 
Estimation 


Housework can’t kill you, but why take a chance?! 
—Phyllis Diller 


The bestselling book The Millionaire Mind (Stanley, 2000) uses data from a survey of 
millionaire households in the United States. The population of millionaires is difficult 
to sample because no list of all millionaires exists. A simple random sample (SRS) 
from the U.S. population is likely to be inefficient because only a small fraction of 
American households have net worth over a million dollars; most returned surveys 
in an SRS will contain few members of the population of interest. Stanley had the 
sampling frame prepared by estimating the proportion of millionaire households in 
each census block group of the United States. He then stratified the block groups 
by estimated proportion of millionaires, and took an SRS of block groups within 
each stratum having at least 30% estimated millionaire households. A total of 5063 
households were selected in those neighborhoods to receive the questionnaire; 1001 
questionnaires were returned, and 733 came from households reporting a net worth 
of at least one million dollars. The sample selected did not cover the entire population 
of millionaires in the United States, since households in block groups with fewer than 
30% estimated millionaire households were not included in the sampling frame. = 


In this chapter, we discuss two situations for designing surveys. The first relates 
to Example 14.1: How to design a survey to sample units that belong to a rare 
population. A population can be rare in several ways. The number of individuals 
belonging to the rare population may be very small; snow leopards are a rare popula- 
tion simply because there are not very many of them. Or there may be a large number 
of individuals, but they form only a small fraction of the population. Millionaires, 
for example, are reasonably plentiful (one estimate has about 3.5 millionaires in the 
United States) but comprise a small percentage of the U.S. population (U.S. Census 
Bureau, 2007, Table 697). An SRS of persons in the United States, therefore, will 
yield few millionaires. Moreover, millionaires tend to be highly clustered, so that 


1 Although this is a wonderful quote, unfortunately housework can kill you; the Vital Statistics at 
www.cdc.gov tabulate deaths by accident in the house. 


all 
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many geographic primary sampling units (psus) may have few, if any, millionaires. If 
we had a list of all millionaires in the United States, it would be quite easy to select a 
probability sample of them. For many rare populations, however, no such list exists; 
indeed, for some rare populations such as persons with Alzheimer’s disease, it may 
be very difficult to determine membership in the population because persons may be 
unaware they have the disease. The challenge is to obtain a sufficiently large proba- 
bility sample of the rare population for the desired accuracy while controlling costs. 

In many surveys, we want to estimate quantities for many subpopulations, for 
example, to estimate the unemployment rate for every county in the United States. If 
we only wanted to estimate the unemployment rate in one county, we could design 
a survey with large sample size in that county. But a survey with sufficiently large 
sample size in every county will have unacceptably large cost. Instead, we would like 
to estimate county unemployment rates using an existing national survey on unem- 
ployment. Such a survey will likely have small sample sizes (or perhaps even no 
observations) in some counties. A sample of 60,000 households may give accurate 
estimates of the national unemployment rate, but the sample might have only a few 
observations in Larimer County—so few that an estimate of the unemployment rate 
in Larimer County that uses only the observations in the sample will have an unac- 
ceptably large margin of error. Larimer County, in this example, is a small area (also 
called a small domain); the population or land area of Larimer County may be large, 
but the sample size in the county is small. Section 14.2 explores models that may be 
used to improve accuracy of estimates for small areas. 


14.1 


Sampling Rare Populations 


Sometimes you would like to investigate characteristics of a population that is difficult 
to find, or that is dispersed widely in the target population. For example, relatively 
few people are victims of violent crime in a given year, but you may want to obtain 
information about the population of violent crime victims. In an epidemiology survey, 
you may want to estimate the incidence of a rare disease, and to make sure you have 
enough persons with the disease in your sample to analyze how the persons with the 
disease differ from persons without the disease. 

One possibility, of course, is to take a very large sample. That is done in the 
National Crime Victimization Survey (NCVS), which is used to estimate victimiza- 
tion rates. Because the NCVS was intended to estimate victimization rates for many 
different types of victimizations and to investigate households’ victimization expe- 
riences over time, it was designed to be approximately self-weighting. The sample 
size for domestic violence victims, however, is very small. The NCVS would need to 
be prohibitively expensive to remain a self-weighting survey and still give sufficient 
sample sizes for all types of crime victims. In this section, we describe survey designs 
that have been proposed for estimating the prevalence of a rare characteristic or esti- 
mating quantities of interest for a rare population; several are based on concepts we 
have already discussed in this book. 

Nonresponse can be an especial hazard in surveys of rare populations. If pop- 
ulation members with the rare characteristic are more likely to be nonrespondents 
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than members without the rare characteristic, estimates of prevalence will be biased. 
In some health surveys, the characteristic itself can lead to nonresponse—a survey 
of cancer patients may have nonresponse because the illness prevents persons from 
responding. It is therefore important to try to minimize nonresponse for any survey 
of a rare population. 


141.1 Stratified Sampling with Disproportional Allocation 


EXAMPLE 14.2 


Sometimes strata can be constructed so that the rare characteristic is much more 
prevalent in one of the strata (say, in stratum 1). Then a stratified sample in which 
the sampling fraction is higher in stratum 1 can give a more accurate estimate of the 
prevalence of the rare characteristic in the general population. The higher sampling 
fraction in stratum | also increases the domain sample size for population members 
with the rare characteristic. The National Maternal and Infant Health Survey (MIHS), 
discussed in Example 11.1, sampled a higher fraction of records from low-birth- 
weight infants to ensure an adequate sample size of such infants. 

Disproportional stratified sampling may work well when the allocation is efficient 
for all items of interest. For example, in the MIHS, a major concern was low-birth- 
weight infants, who have many more health problems. But disproportional stratifi- 
cation may not be helpful for all items of interest in other surveys. Having higher 
sampling fractions in census block groups thought to have high proportions of mil- 
lionaires is sensible for estimating characteristics of millionaires. The design is not as 
efficient for estimating characteristics of persons who work at home, a rare population 
that is not necessarily concentrated in those block groups. 


Edwards et al. (2005) use models to construct strata for sampling rare lichen species 
in Washington, Oregon, and northern California. The rare species were uncommon 
in the pilot sample, so they fit classification tree models to predict presence of four 
common lichen species that frequently occur with the rare lichen species of interest. 
Each of the common lichen species was detected on at least 120 of the 840 sites 
sampled in a lichen air quality study, giving sufficient information to build models 
predicting presence of each species of common lichen from variables such as slope, 
aspect, precipitation, temperature, and relative humidity. Using data collected in a 
second sample, Edwards et al. (2005) estimate that a disproportional stratified sample 
based on the classification tree models would result in a 1.2- to 5-fold gain in sampling 
efficiency for four of the rare lichen species. m= 


141.2 Two-Phase Sampling 


Two-phase sampling methods were discussed in Chapter 12 as a way of using stratifi- 
cation when the information needed to form the strata is not available before sampling. 
To sample a rare population, we would like to stratify on the variable that indicates 
whether individuals belong to the population or not. Screen the phase I sample units 
to determine whether they have the rare characteristic or not. Then subsample all 
(or a high sampling fraction) of the units with the rare characteristic for the phase II 
sample. If the screening technique is completely accurate, use the phase I sample to 


ol4 Chapter 14: Rare Populations and Small Area Estimation 


estimate prevalence of the rare characteristic and the phase II sample to estimate other 
quantities for the rare population. 

What if the screening technique is not completely accurate? If sampling arctic 
regions for presence of walruses, it is possible that you will not see walruses in some 
of the sectors from the air because the walruses are under the ice. Asking persons 
whether they have diabetes will not always produce an accurate response because 
persons do not always know whether they have it. As Deming (1977) points out, 
placing a person with diabetes in the “no-diabetes” stratum is more serious than 
placing a person without diabetes in the “diabetes” stratum: If only the “diabetes” 
stratum is subsampled, it is likely that the persons without diabetes who have been 
erroneously placed in that stratum will be discovered, while the error for a diabetic 
misclassified into the “no-diabetes” stratum will not be found. One possible solution 
is to broaden the screening criterion so that it encompasses all units that might have 
the rare characteristic. Another solution is to subsample both strata in phase II, but to 
use a much higher sampling fraction in the “likely to have diabetes” stratum. 

You may want to use a different two-phase design for estimating characteristics 
of rare population members than for estimating prevalence of the rare population. 
Exercise 20 of Chapter 12 presented optimal sampling strategies for using a two- 
phase sample to estimate prevalence of a disease. 


141. Unequal-Probability Sampling 


To oversample individuals with the rare characteristic, we can create a model for the 
inclusion probabilities based on related characteristics. This is similar to dispropor- 
tional stratified sampling, except that the unequal probabilities may be used directly 
as well as in stratification. Hoeting et al. (2000) developed a model for predicting the 
presence or absence of a species from satellite data. The model gives a predicted prob- 
ability that the species is present for each pixel in the satellite image. The predicted 
probabilities may then be used to form strata or to specify inclusion probabilities z;. 

The Mitofsky—Waksberg method for random digit dialing, discussed in Exam- 
ple 6.12, can be used to sample rare populations that are clustered. In a survey of 
millionaires, census block groups can be treated as clusters. A probability sample of 
block groups is drawn (the probability sampling design should rely on stratification 
as well). Select one household from each cluster; if it is a millionaire household, then 
sample additional households in that cluster. This procedure samples clusters with 
probability proportional to the number of millionaire households. 


1414 Multiple Frame Surveys 


Even though you may not have a list of all of the members of the rare popula- 
tion, you may have some incomplete sampling frames that contain a high percentage 
of units with the rare characteristic. You can sometimes combine these incomplete 
frames, omitting duplicates, to construct a complete sampling frame for the pop- 
ulation. Alternatively, you can select samples independently from the frames, then 
combine sample estimates from the incomplete frames (and, possibly, a complete 
frame) to obtain general population estimates. This multiple frame survey approach 
was pioneered by Hartley (1962). 


EXAMPLE 143 
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FIGURE 14.1 
Examples of dual frame surveys. In (a), frame A is complete and frame B is incomplete. 
In (b), both frames are incomplete. 


(a) (b) 


Suppose you would like to estimate characteristics of persons with Alzheimer’s 
disease in the noninstitutionalized population. Since many users of adult day care 
centers have Alzheimer’s, you would expect that a sample of adult day care centers 
would yield a higher percentage of persons with Alzheimer’s than a general population 
survey. But not all persons with Alzheimer’s attend an adult day care center. Thus, 
you might have two sampling frames: frame A, which is the sampling frame for 
the general population survey, and frame B, which is the sampling frame for adult 
day care centers. All persons in frame B are presumed to also be in the frame for the 
general population survey, so the design in Figure 14.1(a) has two domains: ab, which 
consists of persons in frame A and also in frame B, and a, which consists of persons 
in frame A but not in frame B. In other situations, both frames are incomplete, leading 
to three domains as in Figure 14.1(b): domain a, consisting of persons in frame A 
but not frame B; domain b, with persons in frame B but not frame A; and domain ab, 
consisting of persons in both frames. 

To estimate population quantities from the general dual frame survey depicted in 
Figure 14.1(b), determine the domain | membership of each | Sampled person. Estimate 
the population total tf = ae 1 i by ty + tap + t,, where 7, t,», and f, estimate the 
population totals in domains a, ab, and b, respectively. A variety of estimators can 
be used to estimate the two domain totals; some of these are summarized in Lohr 
and Rao (2000) and Lohr (2009). Exercise 3 gives Hartley’s (1962) estimator for the 
survey depicted in Figure 14.1(b). 


The National Survey of Veterans in 2001 (Choudhry et al., 2002) used a dual frame 
survey to sample the target population of veterans living in private households. Frame 
A was a random digit dialing (RDD) frame that covered the population of telephone 
households. Frame A included all veterans living in telephone households, but many 
households contacted through the RDD survey contained no veterans. Frame B was a 
list of veterans constructed from the Veterans Administration Healthcare Enrollment, 
and Compensation and Pension files. Everyone in frame B was eligible for the survey 
so frame B was less expensive to sample than frame A, but frame B did not include 
everyone in the target population. The dual frame survey thus accorded with Fig- 
ure 14.1(a). It combined complete coverage of the telephone household population 
from the frame A survey and lower cost of sampling from the frame B survey. us 
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EXAMPLE 14.4 Jachan and Dennis (1993) described the use of multiple frames to sample the homeless 
population in Washington, D.C. Four frames were used: (1) homeless shelters, (2) soup 
kitchens, (3) encampments such as vacant buildings and locations under bridges, and 
(4) streets, sampled by census blocks. Although the union of the frames should include 
more of the homeless population than a single frame, it will not include all homeless 
persons. 


Shelters Soup Kitchens 
(frame A) (frame B) 


(SN 


Encampments and Streets 
(taken together to form frame C) 


Membership in more than one frame was estimated by asking survey respondents 
whether they had been or expected to be in soup kitchens, in shelters, or on the street 
in the 24-hour period of sampling. = 


1415 Network or Multiplicity Sampling 


In a household survey such as the NCVS, each household provides information only 
on victimizations that have occurred to members of that household. In a network 
sample to study crime victimization (see Czaja and Blair, 1986; Sudman et al., 1988 
for the general method), each household in the population is linked to other units in 
the population; the sampled household can also provide information on units linked 
to it (called the network for that household). For example, the network of a household 
might be defined to be the adult siblings of adult household members. 

Suppose a probability sample of households is taken. Define G; to be the network 
for unit 7 in the probability sample. Suppose household | has adults John and Mary. 
Then, if networks are formed using the sibling rule, G; consists of John, Mary, John’s 
adult siblings (Suzy and Fred), and Mary’s adult sibling (Mark). John is asked about 
crime incidents that occurred to him, Suzy, and Fred; Mary is asked about incidents 
that occurred to her and Mark. John’s (or Suzy’s or Fred’s) response can be included 
up to three times in the sample: if John’s household is selected, Suzy’s household is 
selected, or Fred’s household is selected. Mark’s or Mary’s information has only two 
chances of inclusion, if Mark’s or Mary’s household is chosen in the sample. An only 
child is included only if his or her household is selected in the probability sample. 

The multiplicity of individual k is the number of links leading to that individual. 
Let w, = 1/(multiplicity of person k) be the multiplicity weight for person k in the 
population of interest. In our example, John, Suzy, and Fred each have multiplicity 
weight 1/3, and Mary and Mark each have multiplicity weight 1/2. Let y, be an 
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indicator variable for whether person k was a victim of crime. Estimate the total 


number of crime victims by 
tynet = > Wi >, OnYe- (14.1) 
ieS keGj 


This estimator and its variance are derived in Exercise 4. 

Network sampling can reduce the sampling variability of the estimated prevalence 
of a rare characteristic because it can provide more information per sampled individ- 
ual. Czaja et al. (1986) found that network sampling provided greater precision for 
estimating prevalence of cancer cases. There are, however, additional possibilities for 
error in network sampling. If John is selected in the initial sample, he must report: (1) 
his value of the response y;,, (2) the response for each person in John’s network (y, 
for persons k linked to John), and (3) the number of population units linked to each 
person k in John’s network (the multiplicity for person k in John’s network). 

John will probably give the correct multiplicity for his siblings. But with other 
linking rules, John’s report of multiplicity for units in his network may be inaccurate— 
if John’s network consists of students who are in class with him, John may not know 
the number of other classes taken by his classmates. Also, John might not report the 
correct value of the response for persons in his network. John might not be aware 
of criminal victimizations experienced by his or her siblings and give an inaccurate 
count. Social desirability of responses is also an issue. John may know which of his 
siblings have cancer, but may not know that one of them is a substance abuser. 


141.6 Snowball Sampling 


Snowball sampling is based on the premise that members of the rare population know 
one another. To take a snowball sample of homeless persons, you would locate a 
few homeless persons. Ask each of those persons to identify other homeless persons 
for your sample, ask the new persons in your sample to identify additional homeless 
persons, and so on, until a desired sample size is attained. Snowball sampling can 
create a fairly large sample of a rare population, but in general does not produce 
a probability sample; strong modeling assumptions are needed to generalize results 
from a snowball sample to the population. Although snowball sampling can identify 
members of a rare population who would be difficult to find with other designs, 
the resulting sample is often far from an SRS. Persons with many connections in 
the population of interest are more likely to be included in the sample than persons 
with few connections. Isolated persons may not be reachable at all. Respondent- 
driven sampling methods (Heckathorn, 1997; Salganik and Heckathorn, 2004) use 
information about network connections in the sample to weight the sample units. 


141.7 Sequential Sampling 


In sequential sampling, observations or psus are sampled one or a few at a time, and 
information from previously drawn psus can be used to modify the sampling design 
for subsequently selected psus. In one method dating back to Stein (1945) and Cox 
(1952), an initial sample is taken, and results from that sample are used to estimate 
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the additional sample size necessary to achieve a desired precision. If it is desired that 
the sample contain a certain number of members from the rare population, the initial 
sample could be used to obtain a preliminary estimate of prevalence, and that estimate 
of prevalence used to estimate the necessary size of the second sample. After the 
second sample is collected, it is combined with the initial sample to obtain estimates 
for the population. A sequential sampling scheme generally needs to be accounted for 
in the estimation; in Cox’s method, for example, the sample variance obtained after 
combining the data from the initial and second samples is biased downward (Lohr, 
1990). Lai (2001) reviews history and uses of sequential methods. 

Adaptive cluster sampling (Thompson, 1990) assumes that the rare population 
is aggregated—caribou are in herds, an infectious disease is concentrated in regions 
of the country, or artifacts are clustered at specific sites of an archaeological dig. An 
initial probability sample of psus (often quadrats, in wildlife applications) is selected. 
For each psu in the initial sample, measure a response such as the number of caribou 
in the psu. If the number of caribou in psu i exceeds a predetermined value c, then 
add neighbors of psu i to the sample. Count the number of caribou in each of the 
neighboring units and add the neighbors of any of those units with more than c 
caribou to the sample. Continue the procedure until none of the neighbors has more 
than c caribou. The adaptive nature of the sampling scheme needs to be accounted for 
when estimating population quantities—if you estimate caribou density by (number 
of caribou observed)/(number of psus sampled) from an adaptive cluster sample, 
your estimate of caribou density will be far too high. Thompson and Seber (1996), 
Thompson and Collins (2002), and Turk and Borkowski (2005) describe various 
approaches for adaptive cluster sampling and give references to other work. 
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Small Area Estimation 


In most surveys, estimates are desired not only for the population as a whole, but also 
for subpopulations (domains). We discussed estimation in domains in Section 4.2 for 
SRSs and showed that estimating domain means is a special case of ratio estimation 
because the sample size in the domain varies from sample to sample. But we noted that 
if the sample size for the domain in an SRS was large enough, we could essentially 
act as though the sample size was fixed for inference about the domain mean. 

In complex surveys with many domains, estimation is not quite that simple. One 
worry is that the sample size for a given domain will be too small to provide a useful 
estimate. The NCVS, for example, gives reliable information on the incidence of 
different types of criminal victimizations in the United States as a whole. However, if 
you are interested in estimating the violent crime rate at the state level for the purpose 
of allocating federal funds for additional police officers, the sample sizes for some 
states are so small that direct estimates of the violent crime rates for those states 
are of very little use. You might conjecture, though, that crime rates are similar in 
neighboring states with similar characteristics, and use information from other states 
to improve the estimate of violent crime rate for the state with a small sample size. 
You could also incorporate information on crime rate from other sources, such as 
police statistics, to improve your estimate. 
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Similarly, the National Assessment of Educational Progress (NAEP; see Exam- 
ple 11.10) data collected on students in New York may be sufficient for estimating 
eighth grade mathematics achievement for students in the state, but not for a direct 
assessment of mathematics achievement in individual cities such as Rochester. The 
survey data from Rochester, though, can be combined with estimates from other cities 
and with school administrative data (scores on other standardized tests, for example, 
or information about mathematics instruction in the schools) to produce an estimate 
of eighth grade mathematics achievement for Rochester that we hope has smaller 
mean squared error. 

Small area estimation techniques, in which estimates are obtained for domains 
with small sample sizes, have in recent years been the focus of intense research in 
statistics. Rao (2003) describes small area estimation methods and gives a bibliog- 
raphy for further reading. Here, we summarize some of the proposed approaches. 
Let ajg = 1 if observation unit i is in domain d and 0 otherwise. In this section, 
the quantities of interest are the domain totals tg = ye 1 Giayi, the domain sizes 
Ng = He aig, and the domain means yyg = ta/Na, for domains d = 1,...,D. 


1421 Direct Estimators 


A direct estimator of t7 depends only upon the sampled observations in domain d: 


tu = >> widiayi- (14.2) 
ieS 
The estimated domain totals in (14.2) satisfy the following additive property: If 
domains d; and d> are mutually exclusive, and if domain d; is the union of domains 
d, and d>, then 


ta, +l = ta. 


The additive property is desirable since we would like the estimated numbers of 
people without health insurance in each demographic group to sum to the estimated 
number of people without health insurance in the population. 
The domain mean is estimated by 
> Wididyi 


es (14.3) 
ye Widid 
ieS 


Because yz is a ratio, the variance is estimated using linearization (see Example 9.2) 


as 
Aunts 1 4 & 
Va) = VD wiaiadi — Ya) | - (14.4) 
Na ieS 


The approximation to the variance is valid if the expected sample size in the domain 
is sufficiently large. Section 11.3 discussed comparing domain means using 
regression. 
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Warning: In an SRS, if you create a new data set that consists solely of sampled 
observations in domain d and then apply the standard variance formula, your variance 
estimator is approximately unbiased. Do not adopt this approach for estimating the 
variance of domain means in complex samples. A sampled psu may contain no obser- 
vations in domain d; if you eliminate such psus and then apply the standard variance 
formula, you may underestimate the variance (see Exercises 5 and 8). Survey soft- 
ware such as SAS PROC SURVEYMEANS calculates the variance correctly when 
you specify domains. 

In practice, the sample size in domain d may be so small that the variance of Ya 
is unacceptably large. Some domains of interest may have no observations at all so 
that a direct estimator cannot be calculated. The next sections describe methods that 
may be used to estimate domain mean and totals in these cases. 


1422 Synthetic and Composite Estimators 


Assume that we know some quantity associated with tg for each domain d. For esti- 
mating violent crime victimization rates, we might use f,q = total amount of violent 
crime in domain d obtained from police reports. Then, if the ratios tg /t,g are similar 
in different domains, and if each ratio is similar to the ratio of population totals 4, /t,, 
then a simple form of synthetic estimator 


P iy 
ta(syn) = bud 
bu 

may be more accurate than ta in (14.2). Certainly the variance of t,(syn) will be 
relatively small, since (i, /t,) is estimated from the entire sample and is expected to be 
precise. If the ratios are not homogeneous, however—if, for example, the proportion 
of violent crime victimizations reported to the police varies greatly from domain to 
domain—the synthetic estimator may have large bias. 

You can also use synthetic estimation in subsets of the population, and then com- 
bine the synthetic estimators for each subset. For estimating violent crime victimiza- 
tion in small areas, you could divide the population into different age-race-gender 
classes. Then you could find a synthetic estimate of the total violent crime victimiza- 
tion in domain d for each age-race-gender class, and sum the estimates for the age- 
race-gender classes to estimate the total violent crime victimizations in small area d. It 
is hoped that the ratios (violent crime victimizations in domain d for age-race-gender 
class c from NCVS)/(violent crime victimizations in domain d for age-race-gender 
class c from police reports) are more homogeneous than the ratios tg/tya. 

The direct estimator is unbiased but may have large variance; the synthetic esti- 
mator has smaller variance but may have large bias. They may be combined to form 
a composite estimator: 


ig(comp) = agig(dir) + (1 — ag)ta(syn) 


for 0 < ag < 1. The relative optimal weights wz depend on the relative variance and 
bias of the direct and synthetic estimators, but one possible solution has aq related to 
the sample size in domain d. If few units are observed in domain d, aq will be close 
to zero and more reliance will be placed on the synthetic estimator. 
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1423 Model-Based Estimators 


In a model-based approach, a superpopulation model is used to predict values in 
domain d. The model “borrows strength” from the data in related domains, or incor- 
porates auxiliary information from administrative data or other surveys. The models 
can often be used to determine the weights ag ina composite estimator. Mixed models, 
described in Section 11.5, are often used in small area estimation. 

The Fay—Herriot model (Fay and Herriot, 1979) is commonly used when a vector 
of auxiliary information xq is available for each domain d = 1,...,D. We wish to 
estimate the population domain mean 0; = yyq. Let Ya be a direct estimator of 6, 
from the survey with variance Viva) = Wy. Assume that 


Ya = Oa + ea, 


where eg ~ N(O, Wa) and e1,..., @p are independent. Also assume that the population 
domain means @, are related to the covariates x, through the model 


64 =X, B+ Va, 
where v1,...,Vp are independent N(0, a?) random variables. Combining the two 
models, we have 
Ja =xIB+va+ ea, (14.5) 


which includes the error term eg from the direct estimator as well as the error vg from 
the model that is assumed to hold for the population domain means. If wg and o are 
known, then the best linear unbiased predictor of 0, is 


4 = aaya + (1 — aa)x! B, 


where ay = 07/(07 + Wa) and B is the weighted least squares estimator of B. The 
estimator 6, thus depends more heavily on the direct estimator Ya when vq = Viva) is 
small; it depends more heavily on the predicted value from the regression model xB 
when wz is large. In practice, o must be estimated from the data and the estimator 
6? used to estimate aq by dy. 

The Fay—Herriot model is an example of an area-level model for small area esti- 
mation; it includes quantities that describe the domain as a whole. In the U.S. Small 
Area Income and Poverty Estimates program (www.census.gov/did/www/saipe), the 
estimated poverty rate for a county is a weighted average of the direct estimate from a 
survey (since 2006, the American Community Survey is used; before that, the Current 
Population Survey was used) and the predicted value from a regression equation using 
auxiliary information from tax records, food stamp programs, and other sources. Each 
covariate is at the domain level; the covariates include the number of food stamp par- 
ticipants in the county and the number of Internal Revenue Service exemptions on tax 
returns with adjusted gross income below the poverty threshold (Bell et al., 2007). In 
a county with a large sample size, 0, is very close to the estimator from the survey. 
Logarithms are taken of the variables so that their distribution is closer to a normal 
distribution. 

A unit-level model requires knowledge of covariate values for each individual in 
the survey. In the NAEP, if Yj; is the mathematics achievement of student j in domain 
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d in the population, you might postulate a model such as 
Yaj = Bo + xXajBi + ua + qj; 


where uy ~ N(0, o2) and ég ~ N(O, a2) ford = 1,...,D and student j in domain d, 
assuming all random variables ug and 4; are independent (Battese et al., 1988). The 
student-level covariate x; (we included one covariate for simplicity, but of course 
several covariates could be included) could come from administrative records, for 
example, the student’s score on an achievement test given to all students in the state 
or the student’s grades in mathematics classes. Assume that the population mean of 
x in domain d, xug = Ny Ae Xqj, is known. Then, if the domain sample size ng is 
small relative to the population size Nz, it can be shown that the best linear unbiased 
predictor of the modeled domain mean jug = Bo + XuaP1 + Ua iS 
ta = Bo + XuaBi + Ya (5a — Bo- iuBi) ; 

where yg = oa / (02 + o /nqa) and Bo and By are the best linear unbiased estimators of 
Bo and £,. The predictor depends more on the direct estimator Ya if Vu a) = De /Na 
is small; otherwise, it depends more on the predicted value from the regression at 
the population domain mean of x. Rao (2003) describes unit-level models and other 
models commonly used in small area estimation. 

An indirect estimator, whether synthetic, composite, or model-based, is essentially 
an exercise in predicting missing data. Indirect estimators are thus highly dependent 
on the model used to predict the missing data—the synthetic estimator, for example, 
assumes that the ratios are homogeneous across domains. When possible, the model 
assumptions should be checked empirically; one method for exploring validity of 
the model assumptions is to pretend that some of the data you have is actually not 


available, and to compare the indirect estimator with the direct estimator computed 
with all the data. 
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Chapter Summary 


Rare populations present special challenges for sampling, since many standard sam- 
pling designs yield few units in the rare population. Several designs discussed in 
previous chapters can be used to increase the number of rare population units in the 
sample. Auxiliary information associated with the rare characteristic can be used to 
design a stratified sample with disproportional allocation. If such auxiliary informa- 
tion is not known in advance, a two-phase sampling design can collect inexpensive 
screening information in the phase I sample, and then collect the detailed survey 
information in phase II. 

Multiple frame surveys, in which independent probability samples are selected 
from sampling frames whose union is assumed to include the entire rare population, 
can greatly reduce the cost of a survey of a rare population. One frame might cover the 
entire population, while other frames might be incomplete yet inexpensive to sample. 
By sampling from multiple frames, you achieve the cost savings from sampling lists 
of rare population members, while also having complete population coverage by 
sampling from a complete frame. 
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Exercises 


1 


14.4 Exercises ‘13 


Network, snowball, and adaptive cluster samples use connections among popula- 
tion members to increase the efficiency of the sampling design. In network sampling, 
persons in a probability sample are asked about themselves as well as persons defined 
to be in their network, for example, their adult siblings. A snowball sample often 
begins with a convenience sample of persons in the rare population, who are then 
asked to provide contact information for other persons in the rare population. In 
adaptive cluster sampling, the responses of an initial probability sample are used to 
select neighboring units for inclusion. 

Small area estimation methods rely on auxiliary information and models to obtain 
estimators of population quantities in domains in which the sample size is too small 
for a direct estimator to be reliable. 


Key Terms 


Adaptive cluster sampling: A sequential sampling design in which estimates from 
the first units selected for the sample are used to determine inclusion probabilities for 
subsequent units. 


Multiple frame survey: A survey in which independent samples are taken from two 
or more sampling frames that are thought to include the whole population. 


Network sampling: A sampling method in which a probability sample is taken from 
a population and each sampled unit provides information on itself and on units in its 
network. 


Rare population: A subpopulation that is uncommon relative to the whole 
population. 


Small area: A subpopulation for which the sample size is small. 


For Further Reading 


Kalton and Anderson (1986), Sudman et al. (1988), Kalton (2003), and Christman 
(2009) review methods for sampling rare populations. The book edited by Thompson 
(2004) describes methods for sampling rare species in wildlife applications. 
Rao (2003) reviews methods for small area estimation, with applications to estimating 
poverty, unemployment, disease prevalence, and census undercounts. 


A. Introductory Exercises 


What designs would you consider for sampling each of the following rare populations? 
a Alumni of your university who are currently working as engineers. 

b_ Persons who are caregivers for a household member who has Alzheimer’s. 

ce Households with children aged 18-36 months. 
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d Muslims in Canada; the 2001 Canadian Census estimated that there were approx- 
imately 600,000 Muslims in Canada in 2001 (about 2% of the population). 


e Businesses that emit benzene. Some types of businesses—for example, gas 
stations—are thought to be likely to emit benzene. For other businesses, the ben- 
zene emissions are unknown, but it is thought that if one business is found to 
emit benzene, it is likely that other businesses in the same industry and area emit 
benzene as well. 


C. Working with Theory 


(Requires calculus.) Kalton and Anderson (1986) consider disproportional stratified 
random sampling for estimating the mean of a characteristic y; in a rare population. 
Let r; = 1 if person i is in the rare population and 0 otherwise. Stratum | contains 
N, persons, M, of whom are in the rare population; stratum 2 contains N> persons, 
with M> persons in the rare population. We wish to estimate the population mean 

Yud = ae riyi/(M, + M2) using a stratified random sample of n; persons in stratum 

1 and np persons in stratum 2. 

a Suppose A = M,/(M, + M2) is known. Let Ya = Ay; + (1 — A)y2, where y; and 
y are the sample means of the rare population members in strata | and 2, respec- 
tively. Show that, if you ignore the finite population corrections (fpcs) and if the 
sampled number of persons in the rare population in each stratum is sufficiently 
large, then 


A’S? ws (1 — A)’ S} 


MP) n2p2 


Viva) © 


where S ? is the the variance of y for the rare population members in stratum j and 
pj = M;/N; for j = 1,2. 

b Suppose that S7 = SS and that the cost to sample each member of the population 
is the same. Let f: = n2/N2 be the sampling fraction in stratum 2, and write 
the sampling fraction in stratum | as fj = kf2. Show that the variance in (a) is 
minimized for a fixed sample size n when k = ./p;/p>. 


(Requires calculus.) Consider the dual frame survey in Figure 14.1(b) in which inde- 
pendent probability samples are taken from frames A and B. Suppose that all three 
domains are nonempty. Let S“ denote the sample from frame A, with inclusion proba- 
bilities 2 = P(i € S“) and sampling weights ws = 1/7“. Corresponding quantities 
for frame B are S®, 1?, and w?. Let 6; = 1 if unit i is in domain ab and 0 otherwise. 
Then #4 = Dies wA(1 — 8); and #2 = Y,-5» w2(1 — 6;)y; estimate the domain 
totals t, and t,, respectively. There are two independent estimators of the population 
total in the intersection domain ab: #4, = Y7j.ga wh diy; and 75, = Yo -.58 WP diy. 

a Let 6 € [0, 1]. Show that 


- * oy “B | 2B 
yg = th + 01, + (1 — OD, + 8 
is an unbiased estimator of t, = oa y; with 


Vii) = V [4 + 674.) + V [CL — 003, + 78]. 
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b Show that Vive) is minimized when 
_ VG5,) + Cov 38, 73.) — Cov 74,74.) 
va) + VE) , 


The estimator in Exercise 22 of Chapter 6 for indirect sampling can be applied to 
network sampling (Lavallée, 2007) to give the estimator in Sirken (1970). In the 
context of network sampling, 2/4 is the sampling frame population for the initial 
sample and U/? is the population of interest, with M elements. The links x define 
the networks: €;, = 1 if person k in UP is in the network of unit i in 2/4. Thus, 
Tk = ee £;, is the multiplicity for person k. 


0 


a_ Show that Tate in Equation (14.1) equals the estimator iy given in Exercise 22(a) 
of Chapter 6. Consequently, Ease is an unbiased estimator of ty. 


b Suppose that /4 = U7? is a population of N persons, and the sample from U/4, 
S“, is an SRS of size n. Let y, = 1 if person k has the rare characteristic and 0 
otherwise. Find VG et) 


c Howdoes the variance in (b) compare with the variance of ty = j..54 ty, which 
uses only information from S“? 


Consider a stratified sample in which an SRS of nz, psus is selected from the population 
of N;, psus in stratum h, for h = 1,...,H. We wish to estimate the mean of domain d. 


a Find Vu) using linearization. 
b Now suppose that a data analyst creates a new data set by deleting observations 


that are not in domain d. If you (incorrectly) act as though this is the full data set, 
what is the estimated variance of y4? 


ce Show that the estimators of the variance in (a) and (b) are unequal if some sampled 
psus have no observations in domain d. The correct variance estimator is given in 
(a) and (14.4). 


Estevao and Sarndal (1999) and Hidiroglou and Patak (2004) study the use of aux- 
iliary information in domain estimation, which can reduce the variance of the direct 
domain estimator 7, in Section 14.2.1. 


a_ If the population total for an auxiliary variable x, t,, is known, we may use the 
ratio estimator 
A a ky 
lar) = lax. 
ty 
If the sample size in domain d is sufficiently large to use linearization, what is 
V(tar1)? Does fa,; have the additive property? 


b_ If we know the population total of x for each domain d, with ty = ey GidXis 
then we can use a domain-specific ratio estimator 


me a beg 
tar2 = lax. 
led 


What is Vii)? Does t7-2 have the additive property? 


(Requires calculus.) Consider the Fay—Herriot model in (14.5). Suppose that Wy, 0?, 
and B are known. 
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a Let 
64(a) = aya + (1 — x7 B 
with a € [0, 1]. Show that, under the model in (14.5), Exy [8u(a) — 04] = 0 for any 
a eé [0,1]. 
b Show that Vij[6a(a) — 0] is minimized when a = og and that Viz[64(aa) — Oa] = 
aqaWq. Consequently, under the model, Viy[6a(@a) — 9a) < Vulva — 9a]. 


D. Projects and Activities 


Construct a population with 20 strata. Each stratum has 8 psus and each psu has 
3 secondary sampling units (ssus), so that the population has a total of 480 ssus. 
Observation j of psu i in stratum h has y,j = h, for h = 1,...,20, so that all 
observations in stratum 1 have the value 1, all observations in stratum 2 have the 
value 2, and so on. Within each stratum, all observations in psus 1-4 are in domain 
1, and all observations in psus 5-8 are in domain 2. 


a_ Select a one-stage stratified sample from the population by selecting an SRS of 
two psus from each stratum and including all ssus within the selected psus in the 
sample. Your sample should have 120 observations. Estimate the population mean 
for each domain along with its standard error. 


b Repeat (a) for a second stratified sample, selected independently (i.e., use a 
different random seed). Compare the domain means from this sample with those 
from (a). Do the domain means vary from sample to sample? 


ce Nowcreate anew data set for your sample in (a) that consists only of observations 
in domain 1, by deleting all the observations in domain 2. What is the estimated 
domain mean from this data set? What is the standard error using this data set, 
and why is it incorrect? 


Forest data. Cells with primary cover-type cottonwood/willow form a rare popu- 
lation in the forest data. What methods discussed in this chapter might be used to 
sample the cottonwood/willow cells? Which do you think will be most efficient? 


Activity for course project. Are there rare populations of interest for the survey you 
studied in Exercise 31 of Chapter 7? If so, what design features were used in the 
survey to sample members of the rare population? 


EXAMPLE 15.1 


Survey Quality 


Duotey. Do you think you've learned from your mistakes? 
Peter. Oh, yes, I've learned from my mistakes and I’m sure | could repeat them exactly. 


—Peter Cook and Dudley Moore, Good Evening: A Comedy-revue in Two Acts 


The American Community Survey (U.S. Census Bureau, 2005) is the largest con- 
tinuing sample survey in the history of the United States. Each year, approximately 
3 million questionnaires are mailed out to households across the United States. Of 
course, a survey of this scale requires a great deal of planning and development, and 
potential inaccuracies in the data need to be resolved before the survey is launched. 
For national estimates of quantities such as unemployment and household size, the 
sampling error of the survey will be very small. But other sources of error such as 
nonresponse and undercoverage are important. It is thus crucial in planning such a 
survey that all errors be considered in the design. = 


Throughout this book, we have concentrated on designing surveys that will 
produce accurate and timely statistics. Chapters 2—7 discussed survey designs that 
could be used to control sampling error for estimating population means and totals. 
Chapters 9-14 outlined other methods for analyzing data from complex surveys. 
Chapters | and 8 discussed nonsampling errors that can arise in surveys. 

In Chapters 2-7 we assumed that there are no nonsampling errors, and the 
only reason that survey estimates differ from population quantities is that a sam- 
ple was taken rather than a census. In many surveys, the margin of error reported 
is based entirely on the sampling error; nonsampling errors are sometimes acknowl- 
edged in the text, but generally are not included in the reported measures of 
uncertainty. 

Dalenius (1977, p. 21) referred to the practice of reporting only sampling error 
and ignoring other sources of error as ““‘strain at a gnat and swallow a camel’; this 
characterization applies especially to the practice with respect to the accuracy: the 
sampling error plays the role of the gnat, sometimes malformed, while the non- 
sampling error plays the role of the camel, often of unknown size and always of 
unwieldy shape.” 


a2] 
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In this chapter, we explore approaches to survey design and analysis that con- 
sider the whole camel. Much of the early inspiration for this approach came from 
W. Edwards Deming, who in addition to writing one of the first books on survey sam- 
pling (Deming, 1950) was also one of the leaders in developing quality improvement 
methods for industry after World War II (Boardman, 1994). Not surprisingly, Deming 
was one of the earliest writers to consider factors that might affect the quality of 
survey estimates. Deming (1944) discussed survey errors due to interviewer variabil- 
ity, survey mode, questionnaire design, sampling variability, nonresponse, and other 
sources now considered to be part of total survey error. 

Quality in surveys draws on many ideas from Deming’s work on quality improve- 
ment (Deming, 1986). Biemer and Lyberg (2003) define survey quality as “fitness 
for use.” While somewhat vague, this definition recognizes the multiple purposes of 
survey data. Eurostat (2000) considers quality to encompass seven dimensions: 


1 Relevance of statistical concept. The statistics collected must meet user needs. 


2 Accuracy of estimates. Estimates should be close to the true values of population 
quantities. 


3 Timeliness. Results need to be disseminated quickly to be useful. Indeed, as 
argued in Chapter 1, one reason for taking a survey rather than conducting a 
census is that the survey can be completed much more rapidly. 


4 Accessibility and clarity of information. Particularly in official statistics, data and 
data products must be accessible to users, and sufficient documentation should be 
provided to enable users to interpret the results. 


5 Comparability. Many surveys such as the National Crime Victimization Survey 
(NCVS) have a purpose of comparing estimates over time; such surveys must be 
conducted so that these comparisons are meaningful. When survey results are to be 
compared for different countries, care must be taken to ensure the concepts being 
measured are interpreted the same way in different countries and that appropriate 
methodologies are used to ensure comparable results (Harkness et al., 2003). 


6 Coherence. Common definitions and standards should be used when data come 
from several sources. 


7 Completeness. The data collector should be able to provide statistics for all 
domains identified by the community of data users. 


While all of these quality dimensions may be important in different contexts, we 
argue that data accuracy is the most important aspect of data quality. Timely, coherent, 
comparable statistics are of little use if they are wildly inaccurate. As defined in 
Chapter 2, an estimator 6 of a population quantity @ is accurate if it is close to the 
true value of the quantity being estimated, that is, if MSE[6] =E 1G — 0)7] is small. 
We can consider the total survey error (Andersen et al., 1979) to be the sum of five 
main sources of error: 


total survey error = coverage error + nonresponse error + measurement error 
+ processing error + sampling error. 


Lessler and Kalsbeek (1992) and Linacre and Trewin (1993) emphasize the concept 
of total survey design: You should design a survey to reduce errors in general, not 
just sampling errors. Of course, to design a survey to minimize all the error, you need 
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to know what the major error components are. If you know that most of the error in 
survey estimates is caused by coverage problems, then you can devote resources to 
improving the coverage. If you know that the coding is highly accurate, then you do not 
need to devote as many resources to improving the quality of the coding procedures. 

Total survey design calls for an interdisciplinary approach. The areas of expertise 
needed to study and reduce sources of error include statistical theory of complex 
surveys, design of experiments, statistical process control, mixed models, cognitive 
psychology, management, and ethnography. 


1.1 


Coverage Error 


As discussed in Chapter 1, coverage is the percentage of the population of interest 
that is included in the sampling frame. A mismatch between the target population 
and the sampling frame can cause coverage bias. Most common is undercoverage, 
where the sampling frame misses part of the population. If the target population mean 
for all N units in the population is yy, let yyr be the mean for the Nr units in the 
sampling frame, and yyy be the mean for the N — Nr units in the target population 
but not in the sampling frame. The bias due to undercoverage is then 

ne Gur — Sun): (15.1) 

N 

The bias is thus low if (1) the population means are approximately the same for the 
covered and noncovered units in the population, thatis, yyr © yyy, or (2) the coverage 
rate, Nr/N, is high. 


Yur — yu = 


11d Measuring Coverage and Coverage Bias 


Estimating undercoverage or bias caused by undercoverage is, in general, difficult. 
If it were easy to identify and reach units missed by the sampling frame, those units 
would have been included when the frame was constructed. By definition, undercov- 
erage is external to the survey and thus information external to the survey must be 
used to assess it. 

You can sometimes tell if there has been undercoverage or nonresponse by com- 
paring survey estimates of demographic characteristics with known values of those 
characteristics for the population. If your estimated number of 18- to 24-year-old 
males from the survey is much lower than the total number of 18- to 24-year-old 
males from a census, then there is likely undercoverage of that subpopulation. But 
demographic counts that match do not necessarily mean that you have full coverage. 
A sampling frame for an e-mail survey may have equal numbers of men and women, 
but it will lack both men and women without e-mail addresses. 

The coverage rate can sometimes be estimated using information from other stud- 
ies or external records. For example, undercoverage in a survey of households with 
infants might be assessed by taking a sample of recent birth certificates and checking 
whether the households are in the sampling frame. The U.S. Census Bureau (2002b) 
reports that about 70% of eligible voters were registered to vote in November 2000, 
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so a survey using voter registration lists as a sampling frame for eligible voters would 
have an estimated coverage of 70%. 

Election polls present many challenges for constructing sampling frames and 
assessing coverage. The target population for a pre-election poll is persons who will 
vote in the election, but no one knows in advance exactly who those persons will 
be. A sampling frame of registered voters will include many persons who do not 
vote on election day. A sampling frame of persons who voted in the last election 
will miss new voters. Many pre-election polls in the United States use models to 
predict who is likely to vote, taking into account voting history and other information. 
Most current polls are conducted by telephone and thus do not include nontelephone 
households. 


The U.S. Census is intended to count every person in the United States; in a sense, its 
mission is to obtain complete and accurate coverage of the country. It is thus essential 
that the coverage be assessed. In the 1980, 1990, and 2000 censuses, postenumera- 
tion surveys were used to estimate the degree of undercoverage and duplicate records. 
Some of the methods used for these surveys were described in Example 13.2. 

Mulry (2004) describes components of error in the 2000 census and in the surveys 
used to evaluate its coverage. The census itself has undercoverage (from households 
that are not contacted), nonresponse (from households that do not return their form, 
or that omit persons living in the household from the form), and duplicate records 
(from persons listed twice, for example, a college student counted at his residence 
in the college city and also listed on his parents’ form). The evaluation survey also 
has undercoverage and nonresponse. Persons who move between census day and the 
postenumeration survey are at different residences for the two surveys; while persons 
who move into the survey areas are counted correctly, the persons who move out 
of the survey areas must be estimated. Both census and the postenumeration survey 
have measurement error, since some persons report an erroneous residence for census 
day. Matching records from the census and postenumeration survey is also subject to 
error. # 


Network or snowball sampling (see Sections 14.1.5 and 14.1.6) can sometimes be 
used to estimate coverage rate, particularly in surveys of rare populations. Rothbart 
et al. (1982) found that a network sample of Vietnam veterans gave improved coverage 
of Vietnam veterans from minority groups. 

The methods discussed so far involve estimating the coverage rate, Np/N, in 
(15.1). The second factor of the undercoverage bias in (15.1), Yur — yyy, depends on 
the mean value for units not in the sampling frame. Estimating yyy requires data from 
the uncovered part of the population, which in general must be obtained from an exter- 
nal source. Large government surveys can sometimes be used to estimate coverage 
bias on responses related to responses of interest. The American Community Survey, 
for example, includes telephone and nontelephone households. It could thus be used 
to estimate the bias for some responses from an educational survey that is conducted 
by telephone. If the telephone and nontelephone households in the American Com- 
munity Survey have significantly different proportions of college graduates, then you 
would expect the educational survey that excludes nontelephone households to have 
bias for estimating the proportion of college graduates and related items. 
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15.1.2 Coverage and Survey Mode 


The mode of survey administration (in-person, telephone, mail, e-mail, fax, Internet, 
and so on) exerts great influence on the coverage properties, and choice of mode 
should be influenced in part by the coverage that can be obtained. Dillman et al. 
(2009) provide an excellent discussion of coverage issues in sample surveys. Other 
considerations for choice of mode, such as response rate and accuracy of responses 
for various modes, are discussed in Chapter 7. 

Area frames usually have the highest coverage. An area frame is constructed by 
selecting a sample of geographical primary sampling units (psus) from the region 
of interest. Field investigators construct a list of housing units for the psus selected 
for the sample, and a probability sample of housing units is selected from the list 
in each psu. Not surprisingly, area frames are also generally the most expensive 
to construct and sample. The nontelephone households in the sample often require 
in-person interviews. 

The sampling frame for a mail or e-mail survey is a list of physical or e-mail 
addresses. The coverage of the frame depends on the completeness and accuracy of 
the list. Even if the frame contains everyone in the target population, the addresses may 
be wrong because persons may have moved or changed their e-mail addresses. E-mail 
surveys often work well for surveys in a university or organization in which everyone 
uses e-mail; for other populations, they exclude persons who do not have e-mail or 
who never check their e-mail accounts. Mail and e-mail surveys carry risks that the 
questionnaire will not reach the intended respondent. A mail survey might be discarded 
by another household member; an e-mail survey might be deleted by a spam filter. 

Telephone surveys may use list frames constructed from directories or random 
digit dialing (see Example 6.12). List frames for telephone surveys, like lists of 
addresses for mail or e-mail surveys, may be incomplete or have incorrect telephone 
numbers. Persons who move frequently are less likely to appear in the directory. At 
this writing, most U.S. telephone surveys include only households with landline tele- 
phones. Estimating the percentage of households who do not have a landline telephone 
is challenging. Tucker et al. (2007) used data from a Current Population Survey sup- 
plement to estimate that in 2004, 6% of U.S. households had a cellular telephone only 
and an additional 5.4% of households had no telephone service.! The households with 
only a cellular telephone, however, differed from those with landline service: They 
were more likely to be renters rather than owners of the housing unit and more likely to 
be one-person households. Adults aged 15-24 and unmarried adults were more likely 
to be in cellular-only households than their older or married counterparts. There may 
be additional nonresponse in a random digit dialing (RDD) survey if households with 
landlines and cellular telephones primarily use their cellular phones. 

Internet surveys are appealing because of their low cost, but obtaining good cov- 
erage of the target population is challenging. At this writing, the most reliable surveys 
that use the Internet to collect data select the sample from a mail, telephone, or area 
frame. They contact the sampled individuals through another mode and then ask them 
to submit survey responses through the Internet. Some survey organizations provide 


'The proportion of households with cellular telephone service has increased since 2004; the National 
Center for Health Statistics reports updated estimates at www.cdc.gov/nchs. 
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computers and Internet access for persons in the sample who do not have them. These 
surveys thus include members of the population who do not have Internet access. 

Unfortunately, careful Internet surveys that collect a probability sample are rare. 
It is difficult to assess coverage in other Internet surveys (Couper, 2000). Internet sur- 
veys in which website visitors volunteer to participate are untrustworthy and should 
not be used to estimate characteristics of a population. The coverage of such surveys 
is unknown because the sample consists of volunteers. Some of the volunteers may 
take the survey many times in an attempt to influence the results. Even if the sam- 
ple matches the target population on demographic characteristics, or is weighted to 
match demographic characteristics through poststratification, it is likely that other 
characteristics will differ. 


In April 2007 the city of Tempe, Arizona, arranged for a market research company 
to conduct a survey of Tempe residents to solicit opinions about a proposed neigh- 
borhood shuttle bus service. An announcement of the upcoming survey and a map of 
the proposed bus route was mailed to every address in the neighborhood in March, 
2007. The market research company took a telephone survey of approximately 700 
Tempe residents. Because the survey was done by telephone, the questionnaire started 
with screening questions to ensure the respondent lived in the area of interest. The 
respondent was then asked to refer to a map that had been mailed earlier and to answer 
questions about proposed bus routes shown on the map. Neighborhood residents who 
were not selected to be in the telephone sample were given the opportunity to respond 
to the survey over the Internet. 

Telephone was a poor choice for the survey mode. The telephone survey required 
screening questions to exclude persons not in the neighborhood. Cell phones were 
not sampled, resulting in undercoverage in this neighborhood close to Arizona State 
University. The survey required respondents to refer to a map that had been mailed 
earlier, and it is likely that many respondents would not have ready access to that map 
when they were called, resulting in measurement error. In addition, city planners and 
neighborhood activists were interested in whether residents adjacent to the proposed 
bus route had different opinions than other residents. The telephone survey could not 
guarantee a sufficient sample size of residents along the route. 

A mail survey would have been a much better choice. The city had already gone 
to the expense of mailing the map to all neighborhood residents. It could have easily 
selected a stratified sample (stratified by proximity to proposed route, with a higher 
sampling fraction for addresses close to the proposed route) of those addresses and 
included the survey in the envelopes mailed to the households in the stratified sam- 
ple. The money saved by not taking a telephone sample could have been used for 
nonresponse follow-up. = 


1513 Improving Coverage 


As with nonresponse, the best way to deal with undercoverage and overcoverage is 
to prevent it. Some options for improving coverage in the survey design are: 


a Check the sampling frame to remove duplicates. 


=» Compare the sampling frame with external sources to check for members of the 
target population that are missing in the frame. 
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s Choose a survey mode or modes that have high coverage of the target population. 


«» Use a multiple frame survey. An area frame, though expensive to sample, often 
has good coverage of the population. Data from the area frame sample can then 
be combined with data from incomplete frames that are inexpensive to sample, 
as discussed in Section 14.1.4. Coverage can often be improved by combining 
samples from several incomplete frames; even if the union of the frames does 
not include the entire target population, the multiple frame survey will have better 
coverage than any of the frames taken singly. A dual frame survey with one sample 
from a frame of landline telephones and another sample from a frame of cellular 
telephones will miss nontelephone households, but will have better coverage than 
a sample of landline telephones that excludes persons who use a cellular telephone 
exclusively. 


Poststratification, discussed in Section 8.5.2, can partially alleviate coverage bias, 
but, as with all after-the-fact adjustments for nonresponse or coverage errors, you do 
not know whether the adjustment truly compensates for coverage bias unless you 
obtain data on the persons not covered by the sampling frame. 


18.2 


Nonresponse Error 


In Chapter 8, we looked at possible remedies for nonresponse that has already 
occurred. It is far better, of course to be able to prevent or reduce nonresponse before 
it occurs. The methods outlined in Section 8.2 can be used to reduce nonresponse at 
the survey design stage. 

We recommend using the AAPOR (2008b) standards for reporting nonresponse 
rates. As with undercoverage, it is often challenging to assess the bias due to non- 
response. In some cases, you can obtain accurate data for nonrespondents from an 
external source such as a population register and use the external records to evalu- 
ate the bias due to nonresponse. In a health survey, you might be able to access the 
medical records of a subsample of nonrespondents and a subsample of respondents. 
You can then compare the respondents and nonrespondents on quantities given in the 
medical records. If the nonrespondents have significantly higher blood pressure than 
the respondents, they may differ from the respondents on key survey items as well. 

Similar comparisons can be done if your sampling frame has substantial auxiliary 
information about each individual in the frame. A university administrator taking a 
survey of students can compare the grade point averages and majors of survey respon- 
dents with those of the survey nonrespondents. In addition to identifying potential 
nonresponse bias, the frame information can be used to construct weighting classes 
or impute values for nonrespondents. 

Comparing survey estimates of demographic quantities to those of a census or 
large survey such as the American Community Survey can also indicate nonresponse 
problems. This can identify undercoverage or nonresponse for different demographic 
groups, and suggest weighting variables that might be useful for nonresponse adjust- 
ment. When comparing results with another survey, be careful that the same def- 
initions and measurements are used. Your survey may have different estimates of 
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unemployment than the American Community Survey because you define unemploy- 
ment differently, or because your survey covers a different time period. 

You can also compare persons who respond initially with those who respond after 
several attempts to reach them. You might speculate that nonrespondents are similar 
to persons who are reached only after great effort, and use the information from the 
callbacks to estimate the nonresponse bias. This is a big assumption, though, and not 
always well founded (Lin and Scheaffer, 1995). 

Under a response propensity framework, nonresponse bias depends on the relation 
between the (unknown) response propensity ¢; of each unit and the variable of interest 
y; (see Exercise 16 of Chapter 8). If ¢; and y; are uncorrelated, then the bias incurred 
by estimating the population mean using the respondents only is approximately zero. 
A model adopted for nonresponse that estimates the response propensities accurately 
will also reduce the nonresponse bias. Unfortunately, we do not know whether the 
propensity to respond is correlated with the responses or whether the model for non- 
response is good because we have no data on the nonrespondents. We can, however, 
fit several models for nonresponse and investigate the sensitivity of the results to the 
modeling assumptions, as described in Little and Rubin (2002). 

Groves (2006) concludes from a review of 30 empirical studies that nonresponse 
bias occurs but is not necessarily correlated with the nonresponse rate. Some studies 
have relatively high response rates and yet still have high bias, while other studies with 
lower response rates have low bias. In general, higher response rates are better and 
complete response is best of all. Paradoxically, though, sometimes efforts to increase 
response rates can also increase nonresponse bias. This occurs, for example, when 
the measures taken to increase response rates also increase the correlation between ¢; 
and y;. An incentive given in a survey might increase the propensity of low-income 
persons to respond and thus result in more bias for estimates of income. 

Much research (see Groves and Couper, 1998, and Groves et al., 2002, Chapters 
1-17) has been done on why persons choose to respond to a survey and how surveys 
can be designed to increase cooperation. Cialdini (1984) identifies factors associated 
with willingness to respond to a survey: 


1 Reciprocation. Will the potential respondent gain something by participating in 
the survey? The 2005 Census Test in Maricopa County advertised that your partic- 
ipation helps your community by making sure everyone is counted. Informational 
booklets describing how the survey data will be used can motivate some per- 
sons to respond. Incentives can increase survey cooperation in some instances; 
Singer (2002) reviews experiments on the effectiveness of incentives for increas- 
ing response rates and concludes that monetary incentives are most effective in 
surveys for which persons have few other motivations to participate. 


2 Authority. Persons are often more likely to provide responses to a survey if it is 
issued by a recognized authority. University faculty members may be more likely 
to respond to a survey sent by the university president than a survey distributed 
by a graduate student. The U.S. Census Bureau (Griffin et al., 2003) sent one 
group of potential respondents a “mandatory” letter saying that participation in 
the survey is “required by law. We are conducting this survey under the authority 
of Title 13, United States Code, sections 141-193, and 221.” Another group was 
sent a “voluntary” letter saying “Your participation in the survey is important; 
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however, you may decline to answer any or all questions.” The mail response rate 
was more than 20 percentage points higher with the “mandatory” letter than with 
the “voluntary” letter. 


3 Consistency. Once someone is persuaded to participate in a survey, that person is 
likely to continue and perhaps participate in other surveys. 


4 Scarcity. The scarcity heuristic is related to reciprocation: A potential respondent 
who believes that the opportunity to participate in the survey is reserved for the 
select few may be more likely to respond. 


5 Social validation. Potential respondents may be more likely to participate if they 
believe others do so. 


6 Liking. Potential respondents may be more amenable to participation if they like 
the interviewer. 


Persons choose to participate in a survey for many different reasons, so a flex- 
ible approach to soliciting responses is helpful. Different survey introductions may 
work better with some subsets of the population. Skilled interviewers use a vari- 
ety of approaches to persuade persons to respond to a survey. We know that some 
nonresponse will occur despite the best efforts of the survey designer. Thus, it is 
valuable to have additional information in the sampling frame—not just for adjusting 
the estimates for nonresponse after the data are collected, but for giving interviewers 
additional information to use when recruiting respondents. 

As mentioned in Chapter 8, different survey modes tend to have different response 
rates. Hox and deLeeuw (1994) found, in their review of studies comparing response 
rates, that in-person surveys typically obtain the highest response rates, telephone 
surveys the second highest response rates, and mail surveys the lowest. Tourangeau 
et al. (2000) report more item nonresponse in self-administered questionnaires than in 
questionnaires administered by interviewers. Some surveys have obtained increased 
response rates by offering potential respondents a choice of response mode. 

Several modes may be used for nonresponse follow-up. The American 
Community Survey conducts interviews using three modes. The initial surveys are 
sent out by mail; this is the least expensive form of data collection at present. The 
following month, households that did not respond to the mail survey are contacted 
by telephone. In the third month, in-person interviews are conducted with a 
subsample of households that did not respond to the mail survey or the telephone 
survey. Households who respond by different modes may have different characteris- 
tics, however. For example, households that respond by mail may be more likely to 
own their homes and have a household head who is white non-Hispanic (Citro et al., 
2004, pp. 101-102). 


l8.d 


Measurement Error 


In Chapters 2-7 we assumed that y,, a characteristic of interest on unit i, is a fixed 
quantity measured without error. When there is measurement error, however, y; is not 
the true characteristic of interest for unit i. Instead, there is some underlying value j1;, 
and y; is a measurement of jz; taken from the survey. For example, suppose that the 
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characteristic of interest is 4; =the true amount that household 7 spent on medical 
care between March and June of last year. The response provided by the household, 
yj, is not necessarily equal to j4;. The question may be worded confusingly so that the 
respondent omits some medical expenses (perhaps omitting over-the-counter medica- 
tions); the respondent may forget some expenses or include expenses from July; char- 
acteristics of the interviewer may lead some respondents to give inaccurate answers 
(perhaps excluding expenses they are embarrassed by); or other circumstances may 
lead to y; differing from j;. Measurement error is the difference between the response 
y; provided by a survey respondent and the true value of the response, jz;. Estimating 
measurement error, like coverage and nonresponse bias, requires additional informa- 
tion and modeling. 

Often, as for measuring the amount spent on medical care between March and 
June, 1; is a fixed value that could be found exactly using specific definitions of “medi- 
cal care” and “household.” Demographic characteristics such as age or ethnicity, phys- 
ical measurements such as body mass index, behavioral variables such as number of 
visits to doctors, and monetary variables can be thought of as having a true underlying 
value ju; that could be determined if the measuring instruments were precise enough. 
In other cases, the true characteristic of interest may not have a precise physical mean- 
ing, as when a consumer confidence survey asks you whether you think you will be 
better off, worse off, or about the same financially a year from now. Although it would 
be possible to compare your financial status 12 months from now with your financial 
status now, that is not the point of the survey—the survey researchers want to know 
how optimistic or confident you are about your short-term financial future. Psychome- 
tricians call a possibly unobservable underlying characteristic, in this case consumer 
confidence, a construct, and attempt to approximate the construct through items that 
can be measured. It is rarely possible for survey questions to correspond exactly to cer- 
tain underlying constructs, however, which is the reason for the advice in Section 1.5 
to report the actual questions asked when summarizing results from a survey. 

The survey instrument, the interviewer, and the respondent all can contribute to 
measurement error. To reduce measurement error due to the survey instrument, follow 
Bradburn’s (2004) Law for Questionnaires: “Ask what you want to know, not 
something else.” Bradburn’s Law, while eminently sensible, can be challenging to 
implement and Bradburn reviews recent research by linguists, psychologists, and 
statisticians on reducing measurement error when constructing a questionnaire. Presser 
et al. (2004) describe methods that can be used to test and evaluate questionnaires. 

Interviewers can contribute to response variability and bias, and interviewer effect 
varies with different modes of data collection. Interviewers can often increase response 
rates and improve accuracy by explaining questions to respondents. But some respon- 
dents may give a more socially desirable response to a survey conducted by an inter- 
viewer than to a self-administered survey, and may report that they exercise more, and 
gamble less, than they actually do. Some interviewers may prompt a respondent toward 
a particular response. Extreme interviewer effects may occur when interviewers falsify 
the data by changing responses or fabricating entire interviews. The American Asso- 
ciation of Public Opinion Research website (www.aapor.org/pdfs/falsification.pdf) 
provides guidelines on detecting and minimizing interviewer falsification. 

Respondents may deliberately or inadvertently provide inaccurate information 
to a survey. Respondents to the NCVS may forget about criminal victimizations that 
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have occurred to them or, if reporting for another person, be unaware of that person’s 
experiences with criminal victimization. A respondent may choose not to report an 
incident of domestic violence to the survey, particularly if the perpetrator is present 
during the interview. 


lial Measuring and Modeling Measurement Error 


Biemer and Stokes (1991) review models that have been proposed for measurement 
error. Suppose that T replications of the measurement of unit 7 could be taken, and let 
yir be the value of the tth replicate measurement on unit i. A simple additive model 
for the measurement error is 


Vit = Mit Bit Ext, (15.2) 


where f; is a fixed bias for respondent i and ¢;, is a random variable represent- 
ing unexplained sources of measurement error. In the simplest model, the ¢;;’s are 
assumed to be independent random variables with mean 0 and variance o7. Define 
fe = oe mi/N and V(wi) = (N — 17! 2, (ui — )’. The assumptions of this 
model imply that all conditions remain the same for replicate measurements, and that 
there are no carryover effects for multiple responses of the same person. 

If jz; is the true characteristic of interest, the survey measurement y;, should be 
as close to j4; as possible. For the model in (15.2), 6; and o? should both be close 
to zero. In psychometrics, two concepts called validity and reliability are used to 
assess this closeness. Validity deals with the correlation between a survey item and 
the true score j1;. Many types of validity have been proposed (Groves, 1989); we define 
theoretical validity to be the correlation between the true score and its observed value: 
theoretical validity = Corr (yj, [4;). 

Sometimes you can find the true value jz; by checking external records. In a survey 
asking the question “Did you vote in the election on September 13, 2005?” you may 
be able to check the voting records to determine whether the person actually voted 
(of course, you must have an accurate way to link persons from the survey to the 
voting records for this to work). Then jz; = 1 if the person is listed as voting in the 
voting records and 0 otherwise; y; = 1 if the person responds that he or she voted. 
The validity of the question is estimated by the correlation between y; and jy;. If 
there is no external source of the true value, you can sometimes estimate validity by 
other methods such as looking at the correlations among answers to closely related 
questions. 

Note that validity is not the same thing as unbiasedness or accuracy. Suppose {1; 
is weight of person i, and the scale has negligible variability but erroneously adds 
5 kg to every measurement. Then Corr (y;, 4;) © 1 but ELy;|u;] = w; +5; in an SRS, 
y will overestimate the true mean weight of the population. In general, you need an 
external source of information to be able to evaluate measurement bias. 

Reliability deals with variability of responses under repeated measurements. If 
all the values of Ge are equal to o” in the model in (15.2), 


Relabiy = V(u) variance of true values 


= — . (15.3) 
o?+V(u) variance of values reported to the survey 
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If the reliability is 1, then o? = 0, that is, respondent i gives exactly the same answer 
over repeated trials. If the answers of respondent i are highly variable over repeated 
trials, then the reliability is low. 

Cronbach’s alpha (Cronbach, 1951) is often used to estimate reliability when 
multiple questions are used to assess the same construct: 


i ki 
~ 1L4+(k—1)F" 


a 


where k is the number of items, and 7 is the average of the pairwise correlations of 
the items. If a is close to one, then there is high reliability. High reliability can occur, 
however, when the questionnaire is constructed so that answers to one question affect 
answers to another. It is possible for all questions to be highly consistent yet for none 
of them to measure the true construct of interest. 

Hansen et al. (1961) and Kish (1962) proposed methods for studying errors due 
to interviewers. Kish (1962) proposed considering the interviewers to be randomly 
selected from a population of possible interviewers, so that a random or mixed effects 
model (see Section 11.5) would be reasonable for examining the effect of interviewer 
variability on the overall variability of estimators. He noted that the measurement 
error component cannot be distinguished from the sampling error unless replicate 
measurements are taken from the respondents. 

We can add an interviewer term to the basic measurement error model in (15.2). 
Let yj; be the response given by respondent i to interviewer j on replicate f: 


Vit = Mi + Bit Bj + Ein (15.4) 


where b; is the systematic effect of interviewer j. We assume that Ey(b;) = 0, 
Vu (bj) = o, En (éijr) = 0, Vu (Eiji) = a, and that all of the b;’s and ¢;j;’s are uncor- 
related. The model assumes that any respondent asked a question by interviewer j is 
likely to deviate from the true value by an amount 5; that is intrinsic to interviewer j. 
For example, in a health survey, perhaps Fred has a tendency to take blood pressure 
readings just a little below the true value. Then every person examined by Fred will 
have a blood pressure reading that is slightly too low, and respondents examined by 
Fred will tend to be more similar to each other than respondents selected at random. 
Or, in a victimization survey, respondents may tend to find an interviewer more sym- 
pathetic and be more likely to tell him or her about victimizations. That interviewer 
would tend to have more reported victimizations than other interviewers. The vari- 
ability due to interviewers can be estimated using standard methods for mixed models 
(Demidenko, 2004). 

The model in (15.4) can be expanded by including interaction effects between 
interviewers and respondents; for example, it might be thought that female respon- 
dents will report a different number of criminal victimizations to a female interviewer 
than to a male interviewer. Terms for mode effects can be added in a mixed-mode 
survey. 

Mahalanobis (1946) proposed interpenetrating subsampling for estimating 
interviewer effects. The basic idea is the same as for estimating the variance of 
systematic sampling (Section 5.5): Assign each interviewer a random subsample 
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of the interviews. Often in surveys, interviewers are assigned according to con- 
venience; for example, an interviewer might be assigned to all households in a 
psu, which confounds the effect of the psu with the effect of the interviewer. In 
interpenetrating subsampling with an SRS, interviewers are assigned households at 
random. 


fae Reducing Measurement Error 


The first step in reducing measurement error is to estimate its prevalence and iden- 
tify the main sources. If the largest component of measurement error is interviewer 
variability, then more standardized interview procedures may reduce the variance 
component. If respondents misinterpret questions, then better questions should be 
written and tested. 

We recommend collecting data using randomized experiments to estimate compo- 
nents of variability and likely sources of bias. Hartley and Rao (1978) and Hartley and 
Biemer (1981) give designs that can be used to estimate interviewer variability from 
surveys. Scott and Davis (2001) consider interviewer effects for binary data. Random- 
ized experiments, conducted before a survey is implemented, can compare versions 
of questionnaires, alternative field procedures, methods of interviewer training, and 
almost any other factor affecting survey quality. 

Fowler (1991) provides advice for reducing interviewer-related measurement 
error: 


= Test your questions. Interview potential respondents to see if they interpret the 
questions as you intend. 


a Write clear questions. If a respondent does not know how to answer a question, the 
interviewer is likely to have more influence on the response. In a self-administered 
survey, unclear questions can lead to more variability or bias in the responses. 
Open-ended questions may be more susceptible to interviewer effects than closed 
questions. 


a Write procedures for administering the survey that will reduce errors. 
a Hire good interviewers. 


a Provide training and supervision for interviewers so they act consistently. Inter- 
viewers should read the questions exactly as written, and should not indicate that 
one response is preferred over another. An interviewer should have a professional 
and neutral demeanor. 


a Give interviewers a reasonable workload. Deming (1986) argues that assign- 
ing numerical quotas to workers decreases quality: An industrial worker who is 
required to make 130 parts per hour cannot pay attention to the quality of the part. 
Cannell et al. (1977) found similar effects for survey interviewers: Interviewers 
with high assignments had more errors in responses. 


a Apply quality improvement principles to the interviewing process. Montgomery 
(2008) describes quality improvement methods for many settings. Reducing mea- 
surement error in surveys fits nicely into this framework. 
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Sensitive Questions 
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Nonresponse and Measurement Error 


Many surveys involve questions that persons might view as sensitive. The American 
Community Survey asks a respondent to report his or her income in the past 12 
months from eight different possible sources of income (wages, alimony and child 
support, interest income, and other sources). The National Household Survey on Drug 
Abuse asks respondents about their use of marijuana and cocaine. Some respondents 
view such questions as intrusive; others may fear that providing accurate informa- 
tion may expose them to penalties (for example, they may fear that reporting their 
true income on a survey may lead to penalties for underpayment of income taxes). 
Some persons may protect their personal information by refusing to respond to the 
survey or to specific items, while others may give inaccurate answers to sensitive 
questions. 

Reputable survey takers promise respondents that their answers will be kept con- 
fidential. Respondents to the American Community Survey are assured that “Your 
data are confidential under Title 13, United States Code, Sections 9 and 214. Title 13 
specifies that the Census Bureau can use the information provided by individuals for 
statistical purposes only and cannot publish or release information that would identify 
any individual. Instead, data are released as profiles of groups of individuals within 
broad geographic areas” (U.S. Census Bureau, 2003a). Singer (2003) reports that 
persons who said they were concerned about confidentiality were less likely to return 
their Census forms by mail, although other factors such as age had a higher association 
with nonresponse than did confidentiality concerns. The American Statistical Asso- 
ciation Privacy, Confidentiality, and Data Security website (www.amstat.org, under 
Committees) provides links to numerous resources on assuring and protecting confi- 
dentiality of data. Even if promises of confidentiality do not influence the response 
rate, they are an ethical obligation to the respondents. 

There is much evidence that many people simply do not provide accurate answers 
to sensitive questions. Tourangeau et al. (2000, Chapter 9) summarize studies in 
which record checks indicate underreporting of certain behaviors. Urine samples 
often contradict persons who say they do not use illegal drugs; counts of abortions 
from abortion clinics far exceed estimates of the total number of abortions from 
surveys. There is also overreporting of behaviors that many deem socially desirable: 
Studies comparing self-reports of voting with actual voting records show that many 
people say they voted when they actually did not (Presser, 1990). 

As with coverage, mode of administration can have a great effect on responses 
to sensitive questions (see Tourangeau and Smith, 1996, and Kreuter et al., 2008). 
Many studies report that higher percentages of people say they have used illegal 
drugs when they fill out the questionnaire themselves than when the questionnaire 
is administered by an interviewer (Tourangeau et al., 2000, p. 295). Some surveys 
on sensitive topics use computer-assisted self-administration, where the respondent 
types answers directly onto the computer. The questions are displayed on-screen and 
are also played through a recording. An interviewer may be in the room to answer 
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questions, but the interviewer does not see the responses typed into the computer 
(Newman et al., 2002). 


1542 Randomized Response 


Sometimes you want to conduct a survey asking very sensitive questions, such as 
“Do you use cocaine?”’, “Have you ever shoplifted?”’, or “Did you understate 
your income on your tax return?” 

These are all questions that “yes” respondents could be expected to lie about. 
A question form that encourages truthful answers but makes people comfortable is 
desired. Horvitz et al. (1967), ina variation of Warner’s (1965) original idea, suggested 
using two questions: the sensitive question and an innocuous question. A randomizing 
device (such as a coin flip) determines which question the respondent should answer. 
If a coin flip is used as the randomizing device, the respondent might be instructed 
to answer the question “Did you use cocaine in the past week?” if the coin is heads, 
and “Ts the second hand on your watch between 0 and 30?” if the coin is tails. The 
interviewer does not know whether the coin was heads or tails, and hence does not 
know which question is being answered. It is hoped that the randomization, and the 
knowledge that the interviewer does not know which question is being answered, will 
encourage respondents to tell the truth if they have used cocaine in the past week. 

The randomizing device can be anything, but it must have known probability P 
that the person is asked the sensitive question and probability 1 — P that the person 
is asked the innocuous question. Other forms of randomized response are described 
in Fox and Tracy (1986). The key to randomized response is that the probability that 
the person responds yes to the innocuous question, p7, is known. We want to estimate 
Ds, the proportion responding yes to the sensitive question. If everyone answers the 
questions truthfully, then 


@ = P(respondent replies yes) 
= P yes | asked sensitive question)P(asked sensitive question) 
+ P(yes | asked innocuous question)P(asked innocuous question) 
= psP + p(1 — P). 


Let b be the estimated proportion of “yesses” from the sample. Since both P and p; 
are known, ps may be estimated by 


A 
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Bs = > (15.5) 
Then the estimated variance of ps is 
ax) VO) 
V(ps) = p2 


The penalty for randomized response appears in the factor 1 /P? in the estimated vari- 
ance. If P = 1/3, for example, the variance is nine times as great as it would have been 
had everyone in the sample been asked the sensitive question and responded truth- 
fully. The larger P is, the smaller the variance of pgs. But if P is too large, respondents 
may think that the interviewer will know which question is being answered. Some 
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respondents may think that only a P = 0.5 is “fair” and that no other probabilities 
exist when choosing among two items. 


An SRS of high school seniors is selected. Each senior in the sample is presented with 
a card containing the following two questions: 


Question |: Have you ever cheated on an exam? 
Question 2: Were you born in July? 


We know from birth records that py = 0.085. Suppose the randomizing device 
is a spinner, with P = 1/5. Of the 800 people surveyed, 175 say yes to whichever 
question the spinner indicated they should answer. Then d = 175/800. Because this 
is an SRS, 


V(b) = (1 — $)/(n— 1) = 0.0002139. 


Thus, 


7 175/800 — (4/5)(0.085) 
s= f 7H = 0.75375, 


and V(ps) = (0.0002139)/(1/5)? = 0.0053. = 


Before using randomized response methods in your survey, you should test the 
method to see if the extra complication does indeed increase compliance and reduce 
bias. Danermark and Swensson (1987) found that randomized response methods 
worked well for estimating drug use in schools and appeared to reduce response bias. 
Duffy and Waterton (1988), however, concluded that randomized response meth- 
ods were not helpful in their survey to estimate incidence of various alcohol-related 
problems in Edinburgh, Scotland. They compared response rates and responses for 
a randomized response group with those for a group asked the questions directly, 
and found that the randomized response group had a lower response rate and lower 
estimated proportion of persons who had drunk more than the legal limit immediately 
before driving a car. Randomized response did, however, increase the complexity of 
the interviews, and interviewers reported that many persons were confused by the 
method. 


Processing Error 


Data entry error occurs when an answer given by arespondent differs from that entered 
into the database. Before computer-assisted interviewing became common, a frequent 
source of errors was a clerk typing the responses into the database. Statistical process 
control methods can be used to reduce errors due to data transfer and coding. Mudryk 
et al. (2001) describe the methods used to monitor and reduce errors from character 
recognition software used to capture the data from scanned survey questionnaires in 
the Canadian Census of Agriculture. 
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Coding errors can occur when responses are recorded. Open-ended questions are 
particularly prone to coding error, since someone must make a decision about how to 
classify a response. A person, when asked about why he or she patronizes a certain 
restaurant, might say that it is because the restaurant has good food and is cheap—two 
different responses. The coder must then decide what response to enter for the person. 

Data editing can also introduce errors. Most survey organizations edit data files 
to remove internal inconsistencies and correct obvious errors (an individual with 
age 103 listed as living with his or her parents probably represents a coding error). 
Some organizations also impute values for missing data. Public-use data files are 
often edited to protect confidentiality of respondents’ data—observations may be 
swapped from one location to another or some responses may be modified (see Doyle 
et al., 2001, for methods used to protect confidentiality). Editing, in general, removes 
errors introduced in other stages of data collection. Over-editing, however, can intro- 
duce additional errors. Granquist and Kovar (1997) report that “... it was not until a 
demographer ‘discovered’ that wives are on average two years younger than their 
husbands that the edit rule which performed this exact imputation was removed from 
the Canadian Census system!” 
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It can happen that reducing one source of error actually increases another, and little 
is known about how this might work in practice. For example, heroic efforts by 
interviewers might result in some of the die-hard nonrespondents providing answers 
for the survey. But there is no guarantee that those responses are accurate—persons 
may make up data to stop the calls. Similarly, it is not clear how use of incentives to 
increase survey response may affect other aspects of accuracy. 

Much research is still needed on how to estimate and improve survey 
quality. Biemer and Lyberg (2003) recommend a holistic approach to survey 
design, considering all possible sources of errors at the design stage. Most of this 
book has focused on methods for reducing and estimating sampling error using a 
design-based approach. The study of error sources involves proposing and fitting 
stochastic models for the sources or error; commonly, mixed models are used that 
incorporate terms describing bias and different sources of variability. A multivariate 
approach is needed since most surveys have multiple responses and errors may be 
correlated among different responses. The models can quickly become very complex 
as more terms are added for interactions of error sources and relationships between 
bias and variance so that the analyst must be careful not to overfit the data. Brick 
et al. (1994) found that some of the proposed models for studying sources of error 
in the U.S. Survey of Recent College Graduates were too complicated to fit without 
oversimplifying the assumptions and instead adopted a less structured approach. 

Marker and Morganstein (2004) describe the use of statistical quality improvement 
methods in survey organizations. Quality improvement programs require that every 
person in the organization be committed to, and rewarded for, survey quality. As was 
stated in Chapter 8, the best time to reduce sampling and nonsampling errors in a 
survey is at the design stage. 
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The quality of the survey should be communicated to data users. Kasprzyk and 
Giesbrecht (2001) recommend that even abbreviated reports on survey results should 
contain the following: 


= Information about the data set, including whether it was based on a probability 
sample 


= Sources of sampling and nonsampling error 

= ‘Total in-scope sample size 

« Unit nonresponse rates 

«» A reference to more detailed information about data collection 


=» A contact for more information 


Some organizations, claiming that sampling error is a small part of total survey 
error, have returned to taking convenience samples, using opt-in Internet panels of 
respondents. Langer (2009) points out that such polls have “multiple methodological 
challenges”: Some of them create potential biases by offering financial rewards to 
persons who volunteer to take surveys, and all of them live outside the framework 
of design-based inference. Probability samples are not perfect, and are subject to 
nonresponse and other nonsampling errors. But they remove many possible sources 
of bias, including the possibility that advocacy groups will bias a poll by encouraging 
their members to participate. The arguments put forward by some proponents of 
inexpensive convenience samples that their results agree with those from organizations 
that take probability samples do not prove the quality of their surveys. After all, the 
Literary Digest Survey discussed in Example 1.1 was accurate for several years—until 
it wasn’t. As Groves (2006, p. 670) says: 


Probability sampling offers measurable sampling errors and unbiased estimates when 
100 percent response rates are obtained. There is no such guarantee with low response 
rate surveys. Thus, within the probability sampling paradigm, high response rates are 
valued. Unfortunately, the alternative research designs for descriptive statistics, most 
notably volunteer panels, quota samples from large compilations of personal data 
records, and so forth, require even more heroic assumptions to derive the unbiased 
survey estimates. 


Statistical sampling is a relatively young field, with many dating its origin as a 
modern discipline to Kiaer (1897). The discipline has been spurred by societal needs 
as well as technological developments. Over the past 100 years, survey researchers 
have developed methods for probability sampling, nonresponse and undercoverage 
adjustment, measurement error models, designed experiments for improving sample 
design and reducing nonsampling errors, computer-intensive inference, small area 
estimation, sampling rare populations, and many other applications. You can now 
solve some of the challenges of the next 100 years. As Gertrude Cox (1957) said 
in her 1956 presidential address to the American Statistical Association, “We are 
surrounded with ever widening horizons of thought, which demand that we find better 
ways of analytical thinking. We must recognize that the observer is part of what he 
observes and that the thinker is part of what he thinks. We cannot passively observe 
the statistical universe as outsiders, for we are all in it.” 
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Chapter Summary 


Total survey error is the sum of five components: coverage error, nonresponse 
error, Measurement error, processing error, and sampling error. The main concern 
with undercoverage and nonresponse is bias. Sampling error produces variability 
in the estimates. Measurement and processing error have both bias and variance 
aspects. 

Reports of survey results should include an assessment of errors. Quality improve- 
ment methods can be used to control errors throughout the survey-taking process. 
Designed experiments are useful for improving survey quality. 


Key Terms 


Total survey design: A philosophy of survey design for minimizing nonsampling 
as well as sampling errors. 


Total survey error: Sum of all sampling and nonsampling errors in the survey. 


For Further Reading 


The book Introduction to Survey Quality (Biemer and Lyberg, 2003) provides a com- 
prehensive guide to sources of errors in surveys, and what to do about them. Lessler 
and Kalsbeek (1992) discuss nonsampling errors and emphasize designing surveys to 
minimize all types of errors. The books edited by Lyberg et al. (1997), deLeeuw et al. 
(2008), and Pfeffermann and Rao (2009a) each contain several chapters on improv- 
ing quality in survey data collection and on choice of survey mode and interviewer 
training. The Federal Committee on Statistical Methodology (2001) summarizes best 
practices and methods used by U.S. government statistical agencies to measure and 
report sources of errors in surveys. Groves and Couper (1998) provide a thorough 
treatment of nonresponse errors in household surveys. Groves et al. (2009) present 
methods for improving survey quality at the design stage. 

Designed experiments and methods used for quality improvement in 
industry are also useful for improving survey quality. The books on experimen- 
tal design and quality control by Oehlert (2000), Juran and Godfrey (2000), Ryan 
(2000), and Montgomery (2008), while not specific for survey operations, give prin- 
ciples that should be much more widely used when designing surveys. Deming’s 
(1986) book Out of the Crisis is an example-filled guide to quality improvement. 
Some useful references on quality improvement in surveys are Biemer and Caspar 
(1994), Colledge and March (1993), Gonzalez (1994), Marker and Morganstein 
(2004), and the AAPOR (2008a) guidelines for best practices in surveys. The quality 
guidelines from Statistics Canada (2003) describe procedures for assuring quality 
in all steps of survey development and implementation. Scheuren and Alvey (2008) 
review the history and methods used in exit polling, with an emphasis on survey 
quality. 
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A. Introductory Exercises 


The National Do Not Call Registry allows U.S. residents to prohibit telemarketers 
from calling the registered numbers. Conkey (2005) reports on a telephone survey 
conducted by the Customer Care Alliance. Respondents who had signed up for the 
registry were asked about their satisfaction with the list; 51% said they are still getting 
calls they thought the registry was supposed to block. Discuss possible sources of 
measurement error in this survey. 


A university wishes to estimate the proportion of its students who have used cocaine. 
Students were classified into one of three groups: undergraduate, graduate, or profes- 
sional school (medical or law school), and were sampled randomly within the groups. 
Since there was some concern that students might be unwilling to disclose their use 
of cocaine to a university official, the following method was used. Thirty red balls, 
sixteen blue balls, and four white balls were placed in a box and mixed well. The 
student was then asked to draw one ball from the box. If the ball drawn was red, the 
person answered question (a). Otherwise question (b) was answered. 


Question (a): Have you ever used cocaine? 
Question (b): Is the ball you drew white? 


The results are as follows: 


Group Undergraduates Graduates Professional 

Total number of 8972 1548 860 
students in group 

Number of students 900 150 80 
sampled 

Number answering yes 123 27 27 


Assuming that all responses were truthful, estimate the proportion of students who 
have used cocaine and report the standard error of your estimate. Compare this stan- 
dard error with the standard error you would expect to have if you asked the sample 
students question (a) directly and if all answered truthfully. 

Now suppose that all respondents answer truthfully with the randomized response 
method but 25% of those who have used cocaine deny the fact when asked directly. 
Which method gives an estimate of the overall proportion of students who have used 
cocaine with the smallest mean squared error? 


C. Working with Theory 


Kuk (1990) proposed the following randomized response method. Ask the respondent 
to generate two independent binary variables X; and X2 with P(X; = 1) = 6; and 
P(X, = 1) = 6). The probabilities 6; and 6) are known. Now ask the respondent 
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to tell you the value of X, if she is in the sensitive class, and X2 if she is not in the 
sensitive class. Suppose the true proportion of persons in the sensitive class is ps. 


a_ What is the probability that the respondent reports 1? 


b Using your answer to (a), give an estimator ps of ps. What conditions must 6) and 
@, satisfy? 


c What is V(ps) if an SRS is taken? 


(Requires linear models.) In Section 15.3.1, we discussed the reliability of a survey 
instrument. Suppose we measure each individual twice, under the same conditions. 
Let X; be the score of person i on the first survey administration, and let Y; be the 
score of person / on the second survey administration. Consider the following model: 
Assume U; ~ N(u, 08). Now let X; = U; + Rj; and Y; = U; + Ri2, where Rj, and Rj. 
are independent N(0, oz) random variables (and are also independent of U;). 


a Using matrices, find the distribution of | ; 


b What is the reliability of the test under this model? 
ce Find E[Y | X =x] and Var[Y | X =x]. 


D. Projects and Activities 


Read Deming’s (1944) article “On errors in surveys.” What sources of error identified 
by Deming are still considered part of total survey error? Did Deming discuss any 
errors that are no longer relevant? What new sources of error have arisen since Deming 
wrote his article? 


The goal of the National Comorbidity Survey Replication is to estimate the prevalence 
of mental disorders in the United States. Read the survey description by Kessler et al. 
(2004). What aspects of this survey might affect data quality? What design features 
were implemented to improve the quality of the survey? 


One problem that has occurred in surveys on sexual behavior in the United States is 
that, typically, men report more opposite-sex sexual partners than women do. This has 
led some researchers to be skeptical of the data quality, since one would expect the 
total number of opposite-sex partners for men to equal the total number of opposite- 
sex partners for women. Read the article by Tourangeau and Smith (1996) on asking 
sensitive questions. What steps did the authors take to reduce measurement error in 
their study? 


The websites fivethirtyeight.com and pollster.com provide commentary on the quality 
of polls in the United States. Read a recent entry and describe the aspects of survey 
quality discussed. 


In Exercise 29 of Chapter 1, you volunteered to be in an online panel for a survey. If 
you were asked to participate in a survey, report on your experiences. What are the 
sources of error in the survey? 


Read the section on survey design (pp. 181-188) from the Consumer Bankruptcy 
Study described in Exercise 29 of Chapter 3 (Warren and Tyagi, 2003). How did the 
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investigators deal with the different components of total survey error? What might 
they have done differently? 


Read the guidelines on statistical ethics by the American Statistical Association 
(1999) or the International Statistical Institute (2009). What specific recommenda- 
tions in the guidelines apply to survey research? 


Read the Code of Standards and Ethics for Survey Research by the Council of 
American Survey Research Organizations (2008). Give examples of how adhering 
to the standards in this code might improve the quality of survey data. 


Return to the survey you critiqued in Exercise 27 of Chapter 1. What sources of error 
were reported? How might the quality of the survey have been improved? 


Activity for course project. What methods were used to improve the quality of the 
survey you studied in Exercise 31 of Chapter 7? In your opinion, how effective were 
these methods? How could the survey have been improved? 


All 


Probability 


Appendix A: Probability Concepts 
Used in Sampling 


| recollect nothing that passed that day, except Johnson's quickness, who, when Dr. Beattie observed, 
as something remarkable which had happened to him, that he had chanced to see both No. 1, and 

No. 1000, of the hackney-coaches, the first and the last; “Why, Sir, (said Johnson,) there is an equal 
chance for one's seeing those two numbers as any other two." He was clearly right; yet the seeing of 
the two extremes, each of which is in some degree more conspicuous than the rest, could not but strike 
one in a stronger manner than the sight of any other two numbers.” 


—James Boswell, The Life of Samuel Johnson 


The essence of probability sampling is that we can calculate the probability with which 
any subset of observations in the population will be selected as the sample. Most of 
the randomization theory results used in this book depend on probability concepts 
for their proof. In this appendix we present a brief review of some of the basic ideas 
used. The reader should consult a more comprehensive reference on probability, such 
as Ross (2006) or Durrett (1994), for more detail and for derivations and proofs. 

Because all work in randomization theory concerns discrete random variables, 
only results for discrete random variables are given in this section. We use the results 
in Sections A. 1—A.3 in Chapters 2-4, and the results in Section A.3—A.4 in Chapters 5 
and 6. 


Consider performing an experiment in which you can write out all of the outcomes 
that could possibly happen, but you do not know exactly which one of those outcomes 
will occur. You might flip a coin, or draw a card from a deck, or pick three names out 
of a hat containing 20 names. Probabilities are assigned to the different outcomes and 
to sets composed of outcomes (called events), in accordance with the likelihood that 
the events will occur. Let Q be the sample space, the list of all possible outcomes. For 
flipping a coin, Q = {heads, tails}. Probabilities in finite sample spaces have three 
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basic properties: 


1 P(Q)=1. 
2 For any event A, 0 < P(A) < 1. 
k k 
3 If the events A;,...,A, are disjoint, then P (U 4) = P(Aj). 
i=l 


i=1 


In sampling, we have a population of N units and use a probability sampling 
scheme to select n of those units. We can think of those N units as balls in a box 
labelled 1 through N in a box, and we draw n balls from the box. For illustration, 
suppose N = 5 and n = 2. Then we draw two labeled balls out of the box: 


If we take a simple random sample (SRS) of one ball, each ball has an equal probability 
1/N of being chosen as the sample. 


ALA Simple Random Sampling with Replacement 


In a simple random sample with replacement (SRSWR), we put a ball back after it is 
chosen, so the same population is used on successive draws from the population. For 
the box with N = 5, there are 25 possible samples (a, b) in Q, where a represents the 
first ball chosen and b represents the second ball chosen: 


d, 1) (2, 1) (3, 1) (4, 1) (5, 1) 
(1, 2) (2, 2) (3, 2) (4, 2) (5, 2) 
d, 3) (2, 3) (3, 3) (4, 3) (5, 3) 
(1, 4) (2, 4) (3, 4) (4, 4) (5, 4) 
d, 5) (2, 5) (3, 5) (4, 5) (5, 5) 


Since we are taking a random sample, each of the possible samples has the same 
probability, 1/25, of being the one chosen. When we take a sample, though, we usually 
do not care whether we chose unit 4 first and unit 5 second, or the other way around. 
Instead, we are interested in the probability that our sample consists of units 4 and 5 
in either order, which we write as S = {4,5}. By the third property in the definition 
of a probability, 


2 
P((4,5)) = PI, 5) U,4)] = PUG, 5)] + PIG.4)] = =. 


A.1 Probability 54] 


Suppose we want to find P(unit 2 is in the sample). We can either count that nine 
of the outcomes above contain 2, so the probability is 9/25, or we can use the addition 
formula: 


P(A UB) = P(A) + P(B) — P(AN B). (A.1) 
Here, let A = {unit 2 is chosen on the first draw} and let B = {unit 2 is chosen on the 
second draw}. Then, 
P(unit 2 is in the sample) = P(A) + P(B) — PAN B) = 1/5 + 1/5 — 1/25 = 9/25. 
Note that, for this example, 
P(ANB) = P(A) x P(B). 


That occurs in this situation because events A and B are independent, that is, whatever 
happens on the first draw has no effect on the probabilities of what will happen on the 
second draw. Independence of the draws occurs in finite population sampling when 
we sample with replacement. 


Al2 Simple Random Sampling without Replacement 


Most of the time, we sample without replacement because it is more efficient—if 
Heather is already in the sample, why should we use resources by sampling her again? 
If we plan to take an SRS (recall that SRS refers to a simple random sample without 
replacement) of our population with N balls, the ten possible samples (ignoring the 
ordering) are 


{1, 2} {1, 3} {1, 4} {1,5} {2, 3} 
{2, 4} {2, 5} {3, 4} {3, 5} {4, 5} 


Since there are ten possible samples and we are sampling with equal probabilities, 
the probability that a given sample will be chosen is 1/10. 


In general, there are 
N! 
I Wes a (A.2) 
n n\(N — n)! 


possible samples of size n that can be drawn without replacement and with equal 
probabilities from a population of size N, where 


k!=k(k—1)(K—2)---1 and O!=1. 
For our example, there are 


5\ _ 5! ~ DREKIRZK TD. - 
2) 25-2)! 2xD)Bx2xl 
possible samples of size 2, as we found when we listed them. 


Note that in sampling without replacement, successive draws are not independent. 
For this example, 


10 


1 
P(2 chosen on first draw, 4 chosen on second draw) = 20° 


EXAMPLE A.1 


EXERCISE A.l 


EXERCISE A.2 


Al 
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But P(2 chosen on first draw) = 1/5, and P(4 chosen on second draw) = 1/5, so 
P(2 chosen on first draw, 4 chosen on second draw) 4 P(2 chosen on first draw) x 
P(4 chosen on second draw). 


Players of the Arizona State Lottery game “Fantasy 5” choose 5 numbers without 
replacement from the numbers | through 35. If the 5 numbers you choose match the 5 
official winning numbers, you win $50,000. What is the probability you win $50,000? 


You could select a total of 
35 35! 
( 5 ) = S301 324,632 
possible sets of 5 numbers. But only 
5 
= 


of those sets will match the official winning numbers, so your probability of winning 
$50,000 is 1/324,632. 

Cash prizes are also given if you match three or four of the numbers. To match 
four, you must select four numbers out of the set of five winning numbers, and the 
remaining number out of the set of 30 non-winning numbers, so the probability is 


5\ (30 

4 i ee 

(2) ~ 324,632° ' 
5 


P(match exactly 4 balls) = 


What is the probability you match exactly 3 of the numbers? That you match at least 
one of the numbers? = 


Calculating the sampling distribution in Example 2.4 

A box has eight balls; three of the balls contain the number 7. You select an SRS 
(without replacement) of size 4. What is the probability that your sample contains no 
7s? Exactly one 7? Exactly two 7s? = 


Random Variables and Expected Value 


A random variable is a function that assigns a number to each outcome in the sample 
space. Which number the random variable will actually assume is only determined 
after we conduct the experiment and depends on arandom process: Before we conduct 
the experiment, we only know probabilities with which the different outcomes can 
occur. The set of possible values of a random variable, along with the probability 
with which each value occurs, is called the probability distribution of the random 
variable. Random variables are denoted by capital letters in this book to distinguish 


EXAMPLE A.2 


A.2 Random Variables and Expected Value bod 


them from the fixed values y;. If X is a random variable, then P(X = x) is the 
probability that the random variable X takes on the value x. The quantity x is sometimes 
called a realization of the random variable X; x is one of the values that could occur 
if we performed the experiment. 


In the game “Fantasy 5,” let X be the amount of money you will win from your 
selection of numbers. You win $50,000 if you match all 5 winning numbers, $500 
if you match 4, $5 if you match 3, and nothing if you match fewer than 3. Then the 
probability distribution of X is given in the following table: 


x | 0 S) 500 50,000 
320,131 4350 150 1 
P(X =x) 
324,632 324,632 324,632 «= 324,632 . 


If you played “Fantasy 5” many, many times, what would you expect your average 
winnings per game to be? The answer is the expected value of X, defined by 


E(X) = EX =) xP(X =x). (A.3) 


For “Fantasy 5,” 
320,131 4350 150 
E(x) = [0x +(5x + {500 x 
324,632 324,632 324,632 
1 ) 176,750 


304,632) 324,632 


Think of a box containing 324,632 balls, in which 1 ball contains the number 50,000, 
150 balls contain the number 500, 4350 balls contain the number 5, and the remaining 
320,131 balls contain the number 0. The expected value is simply the average of the 
numbers written inside all the balls in the box. One way to think about expected 
value is to imagine repeating the experiment over and over again and calculating the 
long-run average of the results. If you play “Fantasy 5” many, many times, you would 
expect to win about 45 cents per game, even though 45 cents is not one of the possible 
realizations of X. 

Variance, covariance, and the coefficient of variation are defined directly in terms 
of the expected value: 


0.45. 


+ (0.000 x 


V(X) = E[(X — EX)’] = Cov(X,X) (A.A) 
Cov (X, Y) = E[(X — EX)\(Y — EY)] (A.5) 
Cha i (A.6) 


VV(X)V(Y) 


JV) 
rae for E(X) £0. (A.7) 


CV (X) = 


EXERCISE A.3 
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Expected value and variance have a number of properties that follow directly from 
the definitions above. 


Properties of Expected Value 
1 Ifg is a function, then E[g(X)] = ) > g(x)P(X = x). 


2 Ifaand bare constants, then E(aX + b) = aE(X) + b. 
3 If X and Y are independent, then E(XY) = (EX)(EY). 
4 Cov(X, Y) = E(XY) — (EX)(EY). 


n m 


5 Cov 3 (aiX; + bi), 3 (GX +4) | = )> >> ajc; Cov (X;, ¥)). 


i=1 j=l i=1 j=1 
6 V(X) = E(X”) — (EXY. 

7 V(X+Y)=V(X)+ V(Y) + 2Cov(X, Y). 

8 —1<Corr(X,Y) <1. 


Prove properties | through 8 using the definitions in (A.3) through (A.7). = 


In sampling, we often use estimators that are ratios of two random variables. But 
E[Y/X] usually does not equal EY/EX. To illustrate this, consider the following 
probability distribution for X and Y: 


x y e PX =x,Y=y) 
x 
1 
1 2 2 — 
4 
2 8 4 : 
4 
3 6 2 : 
4 
1 
4 8 2 - 
4 


Then EY/EX = 6/2.5 = 2.4, but E[Y/X] = 2.5. In this example, the values are close 
but not equal. 
The random variable we use most frequently in this book is 


(A.8) 


Z= 1 if unit 7 is in the sample 
‘| 0 if unit 7 is not in the sample. 


This indicator variable tells us whether the ith unit is in the sample or not. In an SRS, 
n of the random variables Z;,Z5,...,Zy will take on the value 1, and the remaining 
N — nwill be 0. For Z; to equal 1, one of the units in the sample must be unit 7, and 


A.2 Random Variables and Expected Value 000 


the other n — | units must come from the remaining N — | units in the population, so 


P(Z; = 1) = P(ith unit is in the sample) 
1 N-1 
1 n—1 


(A.9) 


Thus, 
E[Z;] = 0 x P(Z; = 0)+ 1 x P(Z; = 1) 
n 
= P(Z; = 1) = —. 
( ) N 
Similarly, for i 4 j, 
P(ZZj = 1) = P(Z; = 1 and Z = 1) 


= P(ith unit is in the sample and jth unit is in the sample) 
2\(N-2 
2 n—2 
me N 
n 


_ n(n—1) 
~ N(N— 1) 
Thus for i 4 j, 
E[Z,;Z;] = 0 x P(Z;Z; = 0) + 1 x P(Z;Z; = 1) 
n(n — 1) 
= P(Z;Z; = 1) = ———.. 
N(N — 1) 
EXERCISE A.4_ Show that 
n(N — n) 
V(Z;) = Cov (Z;, Z;) = =e 
and that, for i 4 j, 
n(N — n) 
Cov (Z;, Z;) = — ————-. 
ee) N2(N — 1) . 


The properties of expectation and covariance may be used to prove many results 
in finite population sampling. In Chapter 4, we use the covariance of x and y from an 
SRS. Let 
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and 
N 
Y> @i — Fv)0i — Su) 
R= i=1 
W— 15,5, 

Then, 

Cov (%, 5) = (1 = ~) ESS (A.10) 

N n 


We use properties 5 and 6 of expected value, along with the results of 
Exercise A.4, to show (A.10): 


Cov (x, y) = “Cov 


N 
aa > Ziyj 
j=l 


Fa a (Z;, Zj) 


= i oe V(Zi) + = > ren (Z;, Z;) 


i=1 i=l jHi 


IN-n& 
= a Sox ee Sa aS XiYj 


i=1 i=l jAi 


>! Xiyj 
1 j=l 


N2 +a | onda ue aT 


i i= 
1 sN 1N-n__ 

= an at a 
1 N- 

= me be Xu)(Qi — Ju) 


. (1 = 2 RS,Sy. 


EXERCISE A.5_ Show that 


Corr(~,y)=R. » (A.11) 


Ad 
Conditional Probability 


In sampling without replacement, successive draws from the population are depen- 
dent: The unit we choose on the first draw changes the probabilities of selecting the 
other units on subsequent draws. When taking an SRS from our box of five balls in 


A.3 Conditional Probability HT 


Section A.1, each ball has probability 1/5 of being chosen on the first draw. If we 
choose ball 2 on the first draw and sample without replacement, then 


1 
P(select ball 3 on second draw | select ball 2 on first draw) = Ti 


(Read as “the conditional probability that ball 3 is selected on the second draw given 
that ball 2 is selected on the first draw equals 1/4.”) Conditional probability allows us 
to adjust the probability of an event if we know that a related event occurred. 

The conditional probability of A given B is defined to be 


_ P(ANB) 
P(A | B)= Sete (A.12) 


In sampling we usually use this definition the other way around: 
P(A NB) = P(A|B)P(B). (A.13) 


If events A and B are independent—that is, knowing whether A occurred gives us 
absolutely no information about whether B occurred—then P(A |B) = P(A) and 
P(B\A) = P(B). 

Suppose we have a population with 8 households (HHs) and 15 persons living in 
the households, as follows: 


Household Persons 
1 1,2,3 
2 4 
3 5 
4 6, 7 
5 8 
6 9,10 
7 11, 12, 13, 14 
8 15 


In a one-stage cluster sample, as discussed in Chapter 5, we might take an SRS 
of two households, then interview each person in the selected households. Then, 


P(select person 10) = P(select HH 6) P(select person 10 | select HH 6) 


()G)=3 


In fact, for this example the probability that any individual in the population is inter- 
viewed is the same value, 2/8, because each household is equally likely to be chosen 
and the probability a person is selected is the same as the probability that the household 
is selected. 

Suppose now that we take a two-stage cluster sample instead of a one-stage cluster 
sample, and we interview only one randomly selected person in each selected house- 
hold. Then, in this example, we are more likely to interview persons living alone than 
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those living with others: 


P(select person 4) = P(select HH 2) P(select person 4 | select HH 2) 
_ (2 he aera 
BENDS & 


P(select person 12) = P(select HH 7) P(select person 12 | select HH 7) 


_ (2 1\ 2 
AB) 4) 32° 
These calculations extend to multistage cluster sampling because of the general 


result 


P(A; NA. N-+- Ag) = P(A; |Ao,--+ ,Ag)P(A2 | A3-..,Ax)---P(Ax). = (A.14) 


but 


Suppose we take a three-stage cluster sample of grade school students. First, we take 
an SRS of schools, then an SRS of classes within schools, then an SRS of students 
within classes. Then the event {Joe is selected in the sample} is the same as {Joe’s 
school is selected M Joe’s class is selected M Joe is selected} and we can find Joe’s 
probability of inclusion by 
P(Joe in sample) = P(Joe’s school is selected) 
x P(Joe’s class is selected | Joe’s school is selected) 
x P(Joe is selected | Joe’s school and class are selected). 


If we sample 10% of the schools, 20% of classes within selected schools, and 50% 
of students within selected classes, then 


P(Joe in sample) = (0.10)(0.20)(0.50) = 0.01. 


A4 


Conditional Expectation 
Conditional expectation is used extensively in the theory of cluster sampling. Let X 
and Y be random variables. Then, using the definition of conditional probability, 

PY=ynNx= 
Pipe = a 
P(X =x) 


This gives the conditional distribution of Y given that X = x. The conditional 
expectation of Y given that X = x simply follows the definition of expectation using 
the conditional distribution: 


EY (X= 2S) yPY Sy |X Sa). (A.16) 
zx 


(A.15) 


The conditional variance of Y given that X = x is defined similarly: 


VY |X =x) =) fy -E(Y|X=nPPY =y|X =»). (A.17) 
; 
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EXAMPLE A.3 Consider a box with two balls, A and B: 
® B 
(ie 


Choose one of the balls at random, then randomly select one of the numbers inside 
that ball. Let Y = the number that we choose and let 


Z= 1 if we choose ball A 


0 if we choose ball B. 


Then, 
py =1|z=)y=} 
= =I)= ai 
1 
PY =3|Z=1)=-, 
( | ) a 
1 
PY =4|Z=l)=- 
( | ) >? 

and 

BV izee (ix) eax \elaet les 
=l= x= x= x=) =3. 
4 4 2 
Similarly, 

1 
P(Y =2|Z=0)= = 
2 

and 
1 
AE = |e = Os) 

so 


1 1 


In short, if we know that ball A is picked, then the conditional expectation of Y is 
the average of numbers in ball A since an SRS of size 1 is taken from the ball; the 
conditional expectation of Y given that ball B is picked is the average of the numbers 
inballB. s 


Note that E(Y |X = x) is a function of x; call it g(x). Define the conditional 
expectation of Y given X, E(Y |X), to be g(X), the same function but of the random 
variable instead. E(Y | X) is a random variable and gives us the conditional expected 
value of Y for the general random variable X: for each possible value of x, the value 
E(Y |X = x) occurs with probability P(X = x). 
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EXAMPLE A4 In Example A.3, we know the probability distribution of Z and can thus use the 
conditional expectations calculated to write the probability distribution of E(Y | Z): 


z_ | EY|Z=2) | Probability 


Nie NIe 
7 


In sampling, we need this general concept of conditional expectation largely so 
we can use the following properties of conditional expectation to find expected values 
and variances in cluster samples. 


Properties of Conditional Expectation 

1 E(XX|xX)=xX. 

E[f(X)Y |X] = fQX)EY |X). 

If X and Y are independent, then E(Y | X) = E(Y). 
E(Y) = E[E(Y | X)]. 

V[Y] = VIE(Y | X)] + E[VY |X)]. 


nan bk WwW Wd 


Conditional expectation can be confusing, so let’s talk about what these properties 
mean. The interested reader should see Ross (2006) or Durrett (1994) for proofs of 
these properties. 


1 E(X|X) = X. If we know what X is already, then we expect X to be X. The 
probability distribution of E(X |X) is the same as the probability distribution of 


X. 
2 ELf(X)Y |X] = fQOEW |X). If we know what X is, then we know X?, or log X, 
or any function f(X) of X. 


3 If X and Y are independent, then E(Y |X) = E(Y). If X and Y are independent, 
then knowing X gives us no information about Y. Thus the expected value of Y, 
the average of all the possible outcomes of Y in the experiment, is the same no 
matter what X is. 

4 E(Y) = E[E(Y |X)]. This property, called successive conditioning, and prop- 
erty 5 are the ones we use the most in sampling; we use them to find the bias and 
variance of estimators in cluster sampling. Successive conditioning simply says 
that if we take the weighted average of the conditional expected value of Y given 
that X = x, with weights P(X = x), the result is the expected value of Y. You 
use successive conditioning every time you take a weighted average of a quantity 
over subpopulations: If a population has 60 women and 40 men, and if the average 
height of the women is 64 inches and the average height of the men is 69 inches, 
then the average height for the class is 


(64 x 0.6) + (69 x 0.4) = 66 inches. 
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In this example, 64 is the conditional expected value of height given that the person 
is a woman, 69 is the conditional expected value of height given that the person 
is a man, and 66 is the expected value of height for all persons in the population. 
V[Y] = V[E(Y |X)]+£[V( | X)]. This property gives an easy way of calculating 
variances in two-stage cluster samples. It says that the total variability has two 
parts: (a) the variability that arises because E(Y |X = x) varies with different 
values of x, and (b) the variability that arises because there can be different values 
of y associated with the same value of x. Note that, using property 6 of Expected 
Value in Section A.2, 


VY |X) = E{LY — EW |X)P |X} = ElY? |X] - [EY | XP (A.18) 
and 


VEY |X)] = E ({E(Y |X) — E[E(Y | X)]}’) 
= E ({E(Y |X) — E(Y)}’) 
= E{[E(Y |X)7} -— (EWP. (A.19) 


EXAMPLE A.5_ Here’s how conditional expectation properties work in Example A.3. Successive 
conditioning implies that 


E(Y) = E(Y |Z =0)P(Z = 0)+ E(Y |Z = 1)P(Z = 1) 


~ (0:3) +(«3)=38 


We can find the distribution of V(Y | Z) using (A.18): 


V(Y |Z =0) = E(Y" |Z = 0) — [E(Y |Z = 0) 


= (2x5) +(@x5)-we=4 


VY|Z=)=E(Y7|Z= 1) -[E(V |Z = DF 


= (2x7) +(#« 7) +(#x3)-@F=15. 


These calculations give the following probability distribution for V(Y | Z): 


Zz V(Y |Z =z) Probability 


Nile 


Nile 


EXERCISE A.6 
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Thus, using (A.19), 
VIE(Y |Z)] = E {[E( |Z) - EP} 
= [E(Y |Z = 0) — E(Y)/P(Z = 0) + [E(Y |Z = 1) — E(Y) PP(Z = 1) 
= E 55)? se ;| oe E = 3.5) x ;| 
= 0.25. 
Using the probability distribution of V(Y | Z), 


1 1 
E[V(Y |Z)] = (4 x ;) + (13 x ;) = 2.79. 
Consequently, 


VY) = VIE(Y | Z)] + E[V(Y | Z)] = 0.25 + 2.75 = 3.00.» 


If we did not have the properties of conditional expectation, we would need to 
find the unconditional probability distribution of Y to calculate its expectation and 
variance—a relatively easy task for the small number of options in Example A.3 but 
cumbersome to do for general multistage cluster sampling. 


Consider the box below, with 3 balls labelled 1, 2, and 3: 


Suppose we take an SRS of one ball, then subsample an SRS of one number from 
the selected ball. Let Z represent the number of the ball chosen, and let Y represent 
the number we choose from the ball. Use the properties of conditional expectation to 
find E(Y) and V(Y). = 
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importance of, 9, 16, 85, 332, 336, 
527-529 
model-based approach and, 205 
nonsampling error reduction, 18, 
332-336, 527-533, 538 
optimal, 87-89, 192-193, 205 
simple random sampling, 46-50 
stratified sampling, 85-95 
unequal-probability sampling, 
231-232 
Design-based inference, 51-54, 60, 
153, 202, 447-452 
regression coefficients, 434-443, 
447-452 
simple random sampling, 51-54 
unequal-probability sampling, 
254-262, 264 
Design effect, 309-312 
in chi-square tests, 407-411 
cluster sampling, 310 
confidence intervals and, 311 
regression coefficients, 435, 
444-445 
sample size estimation and, 311-312 
stratified sampling, 310 
Diagnostics, for regression analyses, 
449, 465-466 
Difference estimation, 141-142 
Disease prevalence, 66, 110, 323-325, 
477, 487, 491-492 
Distribution function, 288-291 
Dollar stratification, 89 
Dollar unit sampling, 252-254 
Domains, 133-138, 154, 518-519 
comparing means of, 160, 272, 
445-446, 464 
small, 518-522 
Double sampling. See Two-phase 
sampling 
Dual-system estimation, 500, 505 


Element, 3 
Empirical cumulative distribution 
function, 290 
Empirical probability mass function, 290 
Error 
nonsampling, 16-18, 330, 332-336, 
527-541 
sampling, 16-18, 527 
total survey, 528 
Estimation bias, 31 
Event, 549 
Expected value, 30, 552-556 
Experiment, designed 
contrasted with survey, 317-319 
use for improving quality, 329-330, 
333, 519, 534, 539, 544-545 
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Fay-Herriot estimator, 521 

Finite population correction (fpc), 36, 60 

First-order correction to chi-square test, 
413-415 

Fixed-effects ANOVA model, 95, 318 

Frame, sampling, 3-5, 18, 529-533 


Generalized regression (GREG), 154, 457-460 
in two-phase sampling, 479-480 
Generalized variance function (GVF), 
386-388, 393, 398-399 
Generalized weight share method, 272-273 
Goodness-of-fit tests, 353, 403, 405-406, 
410, 417. See also Categorical data 
analysis 
Graphs 
complex samples, 294-309 
design of surveys with, 49, 194 
bivariate, 304—309 
regression, 435, 437-438 
simple random samples, 35, 294 
stratified samples, 76 
univariate, 76, 291-292, 294 


Hansen-Hurwitz estimator, 228 
Hartley-Politz-Simmons method, 363 
Hierarchical linear model, 454 
Histograms with survey data, 35, 294-296 
Homogeneity 

measure of, 174-175 

test of, 404-405 
Horvitz-Thompson _ estimator, 

254-262 

Horvitz-Thompson theorem, 254-257 
Hot-deck imputation, 348-349 


240-247, 


Ignorable nonresponse, 339, 351 
Imputation, 346-351 
Inclusion probability, 39, 60, 82, 240, 266 
Incomplete data. See Capture recapture 
estimation; Nonresponse 
Independence 
chi-square test for, 402-404, 411-417 
cluster sampling and, 168, 200 
events, 551 
Indirect sampling, 272-273, 525 
Internet surveys, 7, 24, 335, 531-532, 544 
Interpenetrating subsampling, 198-199, 
538-539 
Interviewers 
effect on survey accuracy, 10, 335, 536 
falsification, 536 
Intraclass correlation coefficient (ICC), 
174-176, 207, 213-214 
Inverse sampling, 492 
Item nonresponse, 329, 335. See also 
Nonresponse 


Jackknife, 380-383 
in two-phase sampling, 481-482 
regression coefficients, 441 
Judgment sample, 5, 9 
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Lahiri’s method, 227-228, 236, 249, 271 
Leading question, 2, 14 
Linearization method for variance 
estimation, 125, 366-369, 392, 393, 
438-441 
Demnati-Rao, 397-398, 489-490 
regression coefficients, 438-441 
Linear regression. See Regression 
analysis; Regression estimation 
Literary Digest Survey, 8-9, 17, 331, 544 
Logistic regression, 455-457 
Loglinear models 
capture-recapture, 501-504 
complex surveys, 419-421 
multinomial sampling, 417-419 


Margin of error, 16, 46, 60 
Mark-recapture estimation. See 
Capture-recapture estimation 
Maternal and Infant Health Survey 
(MIHS), 429-430, 435, 450-452 
Mean 
population, 32 
sample, 35-36 
Mean-of-ratios estimator, 162 
Mean squared error (MSE) 
design-based, 31 
model-based, 56 
Measurement bias, 9, 31 
Measure of homogeneity, 174-175 
Median, 220, 289, 293-294, 
296-297, 379-380, 389-392 
Missing at random (MAR), 339 
Missing completely at random (MCAR), 
338-339 
Missing data, See Nonresponse 
Mitofsky-Waksberg method, 249-251, 
277-278, 514 
Mixed models, 453-455, 521-522, 
537-538, 543 
Mode of survey, 335, 528, 531-533, 535, 
536, 540 
Model-assisted inference, 147, 448, 461 
Model-based inference 
chi-square tests, 416-417 
cluster sampling, 200-205, 262-264 
confidence intervals, 56-57 
design and, 205 
quota sampling, 97 
ratio estimation, 146-153 
regression analysis, 430-434, 
447-455 
regression estimation, 151-153 
simple random sampling, 54-57 
stratified sampling, 95-96 
unequal-probability sampling, 
262-264 
weights and, 288, 443-444, 
447-452 
Model-unbiased estimator, 55, 432 
Multilevel linear model, 454 
Multinomial distribution, 68, 271 


Multinomial sampling 

chi-square tests with, 401-406 
definition, 403 

loglinear models and, 417-419 
Multiple frame surveys, 514-516, 522, 
523 

Multiple imputation, 350-351 
Multiple regression, 441-443. See also 
Regression analysis 

Multiplicity, 6, 517 

Multiplicity sampling. See Network 
sampling 


National Assessment of Educational 
Progress, 453, 519 
National Crime Victimization Survey, 
4, 11-12, 281, 312-317, 329, 331, 
341-343, 386-387, 430 
design of, 312-314 
domains in, 512 
nonresponse in, 314, 329, 331, 
341-343 
questionnaire design, 11-13 
regression, 430 
variance estimation in, 386-388 
weights in, 314-317 
National Health and Nutrition 
Examination Survey, 288, 
304-309, 442-443 
National Household Survey on Drug 
Abuse, 540 
National Immunization Survey, 476-477 
National Pesticide Survey, 4, 92-95 
National Survey of Veterans, 515-516 
Network sampling, 516-517, 523, 525 
Neyman allocation, 89-91, 111 
Nonresponse, 6, 18, 329-364, 533-536 
bias, 331-332, 356, 361-362 
effects of ignoring, 330-332 
factors affecting, 332-336, 534-535 
guidelines for reporting, 355 
ignorable, 339, 351 
imputation for, 346-351 
item, 329 
mechanisms, 338-340 
missing at random, 339 
missing completely at random, 
338-339 
models for, 339, 351-354 
not missing at random, 339-340 
rate, 330-332, 354 
survey design and, 332-336 
unit, 329 
weight adjustments for, 314, 
340-346, 362 
Nonsampling error, 16-18, 330, 
332-336, 527-541 
Normal equations, 431, 436, 442 
Notation 
cluster sampling, 168-170 
complex surveys, 286 
ratio estimation, 118 


Notation (continued) 
simple random sampling, 33-35 
stratified sampling, 77-79 


Odds ratio, 402-403, 407, 408-409 

One-stage cluster sampling. See Cluster 
sampling 

Optimal allocation, 87-90, 100, 
111-112 

Ordinary least squares, 138, 431 

Overcoverage, 6, 532 


Percentile. See Quantiles 
Pilot sample, 48 
Plots. See Graphs 
Poisson sampling, 252, 266 
Polls, public opinion, 4, 7, 14, 48, 530 
Population 
estimating the size of, 495-505 
finite, 28, 32 
sampled, 3-4 
target, 3-5 
Poststratification, 121, 142-143, 154, 
342-345, 460 
as generalized regression, 460 
for nonresponse, 143, 342-345 
Precise estimator, 32 
Primary sampling unit (psu), 165, 207 
Probability distribution, 552 
Probability mass function, 289, 319 
Probability proportional to size (pps) 
sampling, 231, 266. See also Unequal 
probability sampling 
Probability sampling, 2, 25-33, 60 
Probability theory, 25, 549-552 
Product-multinomial sampling, 404 
Propensity score, 338 
Proportional allocation, 85-87, 100 
Public Use Microdata Samples, 279 
Purposive sample, 5, 9, 35, 263, 486 


Quality improvement, 335, 539, 543, 545 

Quantiles, 289, 293-294, 296-297, 
389-392 

Quantile-quantile plot, 323 

Question order, 11, 15-16 

Questionnaire design, 11-16, 19, 335, 528, 
536 

Quota sampling, 96-99 


Raking, 344-345, 356 

Random-coefficient regression model, 454 

Random digit dialing, 6, 249-251 

Random effects, 454 

Random-effects ANOVA model, 200-201, 
318 

Random group methods, 370-373 

Randomization inference, 51-54. See also 
Design-based inference 

Randomized response, 541-542 

Random numbers, use in selecting sample, 
29, 34-35, 62, 65-66, 68, 225-226 


Random variable, 552-556 
Ranked set sampling, 492-494 
Rao-Hartley-Cochran estimator, 275 
Rao-Scott test, 413-415, 422, 426-427 
Rare population, 511-518 
Ratio estimation, 117-138, 146-153, 
155, 284, 459-460 
bias, 122-129 
capture-recapture and, 496-497 
combined, 144-146, 284 
complex surveys, 284 
design-based inference, 148, 459-460 
estimating means, 118, 122-126 
estimating proportions, 129-131 
estimating ratios, 118, 122-126 
estimating totals, 118, 124-126, 284 
mean squared error, 122-129 
model-based inference, 146-153 
reasons for use, 119-122, 133 
separate, 144-145, 284 
two-phase sampling and, 469-471, 
477-479, 485 
variance, 124-129, 368-369 


variance estimation, 125-126, 368-369 


Realization, of random variable, 54, 
552-553 
Regression analysis, 429-468 
causal relationships and, 447 
complex surveys, 434-435, 445-446, 
464 
confidence intervals, 431, 438-440 
design-based inference, 434-452 
design effects, 435, 444-445 
diagnostics, 449, 465-466 
effects of unequal probabilities, 435 
estimating coefficients, 437, 442 
graphs, 433-434, 437-438 
model-based inference, 430-434, 
447-455 
purposes of, 447 
software, 444-445 
straight-line model, 430-441 
variance, 395, 438-439, 442 
variance estimation, 395, 439-442 
Regression estimation, 138-142, 
151-155, 457-460 
bias, 138-139 
estimating means, 138-142 
estimating totals, 138-142, 457-460 
generalized, 118, 457-460 
mean squared error, 139 
model-based inference, 151-152 
reasons for use, 138 
two-phase sampling, 479-480 
variance, 138-139, 459 
variance estimation, 139, 459 
Regression imputation, 349-350 
Replicate weights, 377-380, 385 
Replication for variance estimation, 
373-386 
Resampling for variance estimation, 
373-386 


Residuals 
plotting, 147, 150-151, 433, 465 
use in variance estimation, 126, 139, 
368, 459 
Respondent burden, 335-336 
Respondent-driven sampling, 517 
Response propensity, 338 
Response rate, 330-332, 354-356. 
See also Nonresponse 


Sample. See also specific sample design 
cluster, 25—28, 60, 165-280, 
281-282 
convenience, 5, 29, 99, 433, 544 
definition of, 3 
judgment, 5, 9 
probability, 2, 25-33, 60 
purposive, 5, 9, 263 
quota, 96-99 
representative, 2-3 
self-selected, 2, 24, 99 
self-weighting, 40, 60, 172, 179, 
287-288 
simple random, 25-27, 33-59 
stratified, 25-27, 60, 77-101, 282 
systematic, 25-27, 50-51, 60, 
196-199, 226-227 
Sampled population, 3-4 
Sample size, 44, 46-50 
accuracy and, 8 
cluster sampling, 195-196 
complex surveys, 311-312 
decision-theoretic approach, 67 
design effect and, 311-312 
importance of, 49-50 
simple random sampling, 46-50, 59 
stratified sampling, 91 
Sampling, advantages of, 16-18 
Sampling distribution, 29-30, 60, 129, 
552 
Sampling error, 16-18, 527 
Sampling fraction, 36 
Sampling frame, 3, 5, 18, 529-533 
Sampling unit, 3 
Sampling weight. See Weights 
Scatterplots with survey data, 
304-309 
Secondary sampling unit, 165, 207 
Second-order correction to chi-square 
test, 415 
Selection bias, 5-9, 16, 31, 50, 222 
Self-representing psu, 312 
Self-selected sample, 2, 24, 99 
Self-weighting sample, 40, 60, 85, 172, 
179, 287-288 
advantages of, 172 
complex surveys, 287-288, 290, 
291, 294 
Sensitive questions, 540-542 
Sen-Yates-Grundy variance, 241 
Separate ratio estimator, 144, 284 
Sequential sampling, 517-518 
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Simple random sampling, 25-27, 
33-59, 550-552 
cluster sampling compared with, 27, 39, 
165-166, 173-176, 224 
confidence intervals, 39-45 
design, 46-50 
design-based inference, 51-54 
design effect and, 309 
estimating means, 37 
estimating proportions, 38 
estimating totals, 37 
model-based inference, 54-57 
notation for, 34-35 
reasons for use, 58 
sample size, 44, 46-50, 57 
selection of, 34, 65-66, 68 
stratified sampling compared with, 27, 
58, 74 
systematic sampling compared with, 
27, 50-51, 196 
variance, 36-37, 52-53 
variance estimation, 36-37 
with replacement (SRSWR), 33, 60, 
550-551 
without replacement (SRS), 33, 551-552 
Small area estimation, 518-522 
Snowball sampling, 517, 530 
Software, ix—x, 287, 393, 444-445 
Standard deviation, 33 
Standard error, 36. See also Variance 
Stochastic regression imputation, 349 
Strata, 26, 74, 100 
Stratification variable, choice of, 91-95 
Stratified random sampling, 77-101 
allocating observations to strata, 85-91 
cluster sampling compared with, 
166-168 
confidence intervals, 79-80 
defining strata, 91-95 
design, 85-95 
design effects, 310 
estimating means, 78 
estimating proportions, 80-81 
estimating totals, 78 
model-based inference, 95-96 
notation for, 77-78 
reasons for use, 74-77, 513 
sample size, 91 
simple random sampling compared with, 
27, 58, 74 
variance, 74, 79 
variance estimation, 79, 81 
weights in, 82-84 
Stratified sampling, 25-27, 58, 60, 73-100, 
197, 282, 513 
chi-square tests and, 409-410 
complex survey component, 282 
rare events, 513 
two-phase sampling and, 483-484 
Subdomains. See Domains 
Subsidiary variable, 118 
Substitution, for nonrespondents, 6 
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Successive conditioning, 254, 560 

Superpopulation, 41-42 

Survey of Youth in Custody, 298-303, 
371-373, 383-384, 414-415 

SURVEY program, xi, 72 

Synthetic estimator, 520 

Systematic sampling, 25-27, 50-51, 60, 
196-199, 226-227, 244-245 


Tag-recapture estimation. See 
Capture-recapture estimation 
Target population, 3-5 
Taylor series, 366-369 
Telephone surveys, 3, 249-250, 266, 531 
cellular telephones, 531 
random digit dialing, 249, 531 
response in, 335, 535 
Telescoping of responses, 10 
3-P sampling, 251-252 
Three-stage cluster sampling, 262, 278, 
286 
Total, population, 30 
Total survey design, 528-529 
Total survey quality, 543-545 
Two-phase sampling, 336-338, 469-495 
design, 482-486 
generalized regression estimation, 
479-480 
jackknife, 481-482 
for nonresponse, 336-338 
for rare events, 513-514 
for ratio estimation, 477-479, 485 
for stratification, 473-477, 483-484 
used to estimate disease prevalence, 
477, 491 
Two-stage cluster sampling. See Cluster 
sampling 


Unbiased estimator 
design-based, 31-32, 51-54 
model-based, 55 
Undercount, in U.S. census, 5, 500, 
505-506 
Undercoverage, 5, 529 
Unequal-probability sampling, 219-280, 
514 
complex surveys and, 281—282 
design, 231-234 


Unequal-probability sampling 
(continued) 
design-based inference, 
254-262 
estimating means, 230, 234, 247 
estimating totals, 223, 229, 234, 241, 
247 
examples of, 221, 249-254 
model-based inference, 262—264 
one psu, 221—225 
one-stage, 225-235, 238-245 
reasons for use, 220-221, 514 
with replacement, 234-238 
without replacement, 238-248 
selecting psus, 225-228, 244-245 
simple random sampling compared 
with, 224 
stratified sampling compared with, 
220-221 
two-stage, 235-238, 245-246 
variance, 224, 229, 241, 245 
variance estimation, 229-230, 
241-247 
weights, 234-235, 246-248 
Unit 
observation, 3 
primary sampling (psu), 165 
sampling, 3 
secondary sampling (ssu), 165 
Unit nonresponse, 329. See also 
Nonresponse 
Universe, 28 


Variance, 31, 553 

cluster sampling, 170-171, 
179-180, 184, 228-229, 241, 
245, 255-257 

complex surveys, 282, 309-310, 
365-369 

model-based, 95-96 

population, 42, 46, 293-294 

random variable, 552-556 

ratio estimation, 125—126, 368-369 

regression coefficients, 395, 439, 442 

regression estimation, 139, 459 

sample, 36 

sampling distribution, 31 

simple random sampling, 36, 51-54 
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stratified sampling, 74, 79 
unequal-probability sampling, 
224, 229, 241, 245 


Variance estimation 


cluster sampling, 170-171, 179-180, 
185, 228-230, 235-236, 241, 243, 
245, 257-259 

complex surveys, 365-393 

insufficiency of weights for, 84, 286, 293, 
365 

ratio estimation, 125-126, 368-369 

regression coefficients, 395, 439, 

442 

regression estimation, 139, 459 

replication methods, 373-386 

simple random sampling, 36 

software, 393 

stratified sampling, 79, 81 

unequal—probability sampling, 229-230, 
241-247 


Wald test, 411-414 
Weighted least squares, 147-148, 
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Weighting-class adjustments for 


nonresponse, 340-348 


Weights, 39-40, 285-288, 294-309, 


340-346, 443-444, 447-452 
cluster sampling, 170, 172, 
179-180, 184, 186, 189, 223, 
234-236, 246 
complex surveys, 285-288 
contingency tables, 408 
epmf and, 290-294 
graphs and, 294-309 
insufficiency for variance 
estimation, 84, 286, 293, 365 
model-based analysis and, 288, 443-444, 
447-452 
nonresponse adjustments, 
314, 340-346, 362 
regression and, 443-444, 
447-452 
stratified sampling, 82-84, 285 
truncation of, 286, 342 
unequal—probability sampling, 234-235, 
246-248 


